Statistical challenges for reproducibility in neuro-imaging Materials... · Statistical challenges...
Transcript of Statistical challenges for reproducibility in neuro-imaging Materials... · Statistical challenges...
Statistical challenges for reproducibility in neuro-imaging
Cyril Pernet, University of Edinburgh, UK
[email protected]@CyrilRPernet
Nature 533, 452–454 (26 May 2016)
“We want to be discovering new things but not generating too many false leads.” M. Munafo
The scientific method (Popper)
Hypothesis
Study design& Prediction
Data collection
Data analysis
Interpretation for/against prediction
Publication
What to expect from human data?
• ‘The inherent variability in biological systems means there is no expectation that results will necessarily be precisely replicated.’
• Issues are that main results / conclusion don’t replicate
experimental design and power
because of cognitive bias
statistical analysis
Begley & Ioannidis (2015) Circ Research 116
Hypothesis
Study design& Prediction
Data collection
Data analysis
Interpretation for/against prediction
Publication
The scientific method (Popper)
Interpretation for/against prediction
low statisticalpower
Quality control
Power analyses
• One of the main issue in reproducibility is the lack of power – a significant effect with a small sample is likely to be driven by noise.
• Power is the likelihood to reject H0 (e.g. p<0.05) when there is an effect there to be detected (your type II error goes down).
• Depends (mainly) on your threshold alpha, the sample size and the effect size.
• Remember a published results from an under-powered study is likely bigger than in the population.
http://fmripower.org/
Can account for the 2 levels modelling of fMRI (within and between subject variance)
Works on ROI
Need data
For a given sample size, shows power per ROI (non corrected)
For a given power and ROI, shows power curve –winner curse ! A priori ROI a better, here effect is over estimated
http://neuropowertools.org/
• Peak-wise and cluster-wise analysis using a cluster forming threshold
• Any stat map can be analysed !
Estimate straight from Neurovault ! Lots of data available(see Chris Gorgolewski)
Post-HOC Power
• study 1 and study 2 cannot reject H0 and post hocpower is higher in study 1 – does that meanstronger support H0?
No, p values are P(effect/H0) not P(H0)
Power Approach Paradox: study 1 is actually closerto reject H0 (power and p-value are related)
Post-hoc power is not to be used to interpret nullresults
One tailed z-test
study 1
study 2
Hoening & Heisey (2012) Am Stat. 55
Statistical analyses ISSUES
• Exploratory data analysis
• Usually no inference – goal is to have pilot data, look at them and generate hypotheses
• Confirmatory data analysis: Test hypothesis (roi based most often)
• FINDING THE BALANCE BETWEEN EDA AND CDA CYCLES IS KEY – EDA CONSISTS TOO OFTEN IN NULL
HYPOTHESIS TESTING WITHOUT HYPOTHESES
• LEAD TO P HACKING AND COGNITIVE BIASES
• EXPLORATION OF A SET OF CONFIRMATORY ANALYSES
p-hacking
• Researchers’ degree of freedom: which observations to record, which onesto compare, which factors to control and test for.
• P-hacking: try different things until you get the result you’re looking for. Notnecessarily bad intensions or fraud ! Simply an effect of cognitive biases.
Texas sharp-shooter effect: firing off a few rounds and then drawing a bull’seye around the bullet holes trying analyses technique until is startshowing what we expect (needs blinding).
Confirmation bias: carefully debugging analyses and debunking data thatcounter a favoured hypothesis, while letting evidence in favour of thehypothesis slide by unexamined (needs blinding and strong inference).
BLINDING
• On the collection /source of data: e.g. shielding both patients and clinicians fromknowing who receives an experimental drug or a placebo
• The analysis: methodically perturbs data values and/or labels and develop theanalysis pipeline ensuring all is fine
• ‘’For example, a plausible, although still blinded, graph may lead the researcherto ask whether a sample explores the full range of an independent variable, or itmight trigger a revisiting, before unblinding, of the scaling of one of thevariables. Another graph might suggest that the whole effect is driven by a singleoutlier point, and suggest that the researcher needs more data, again beforeunblinding’’
MacCoun & Perlmutter (2015) Nature 526
1 – plan data sample and stopping rule (power / sequential analysis)2 – report all variables recorded and conditions tested (selective reporting / bias)3 – report data before/after cleaning and choice of cleaning rules (confirmation bias)4 – report all analyses and the effect of each variables, in particular results with/without covariates (Texas sharp-shooter effect & confirmation bias)
(see COBIDAS checklist – Tonya White)
Stability / Variability of results
• Regions show stable results with increased effect sizes
• Effect sizes and precise locations could vary a lot (up to 8 z units)
• This led to large differences once looking at thresholded map (same as in psychologycomparing effect sizes vs comparing sig. vs non-sig results)
You must declare all analysis performed (e.g. when modelling using such basis set wedon’t see any effect, but it works using that other set) – and share the maps (effectsize) because peaks from thresholded maps aren’t stable.
Software usage
Clicking on a GUI is not reproducible
1) keep track of every analysis step (SPM batch, Nipype workflow, AFNI and FLS scripts)
2) no manual editing (use command to copy/paste, change names, etc)
3) save seeds of random number generators
4) share as supplementary material and/or in a repository
Sandve et al. 2013 Plos Comp Bio 9
Software environment
• Should we set up virtual a machine and take snapshots? If youuse Neurodebian, simply refer to the running version (peoplecould recover that version at any time) – in all cases report youros and version
• Invest in using version control (e.g. https://github.com/)
Improving Code
• Use literate programming ! The code should tell a story about
what is happening the data at each step
which hypotheses are being tested
how a given computation has been tested / validated
Use workflow engines (see Pierre Bellec)
PSOM http://psom.simexp-lab.org/
AA https://github.com/rhodricusack/automaticanalysis/wiki
Nipype http://nipy.org/nipype/
TAKE HOME MESSAGE
• Plan study with enough statistical power
• Avoid bias using preregistration and/or blinding (ask a friend, your boss, etc to blind labels in the experiment and/or groups – this can be that simple)
• Be diligent during reporting (all techniques, methods, thresholds)
• Focus on effect sizes – share raw maps
• Share your data and code – this is the real science not the paper