What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or...
Transcript of What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or...
Jian‐Liang (Jason) Li Ph.D.Director, Integrative Bioinformatics Group
National Institute of Environmental Health Sciences
Towards Reproducible Cancer GenomicsImplementing Rigor And Reproducibility
in Bioinformatics Group
What is Reproducibility
WikipediaReproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone else working independently, whereas reproducing an experiment is called replicating it.
Reproducibility for Computational AnalysisAn analysis is described or captured in sufficient detail that it can be precisely reproduced (James Taylor)
Is There a Reproducibility Crisis?
90 percent agreed that such a crisis exists More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments
Baker M Nature 2016
Reproducibility Movement
New NIH Requirements for Grant ProposalsRigor and Reproducibility
Why Should We Implement Rigor and Reproducibility in Bioinformatics Group
Share responsibility
Science rapidly become data intensive, and all biology is computational biology nowBioinformatics group plays an important role in current biomedical researches More and more research groups rely on bioinformaticians or bioinformatics group for their data analyses and interpretations
Research Project Workflow
Project Conception
ExperimentalDesign
Data Analysis
DataInterpretation
Project Conclusions
Some published data analyses are not reproducible:Analysis might not be performed as describedMissing software, version, parameters, data, etc
Research Project Workflow
Project Conception
ExperimentalDesign
Data Analysis
DataInterpretation
Project Conclusions
Data Sharing
Embedded ModelEnhance
Reproducibility for Scientific Study
Experimental Design
Partner with Genomics core to help with design questionsExploratory study and confirmatory studyControls; Sample Size; Coverage; Randomization
Standardized project submission form, required to consult with Genomics and Bioinformatics staffs before submit the sampleBackground; Hypothesis; Experimental design; Data Analysis Strategy
NGS project review committee
Data Analysis
Quality Control Establish Standard Operating Procedures Automate routine analyses
Scenario in Bioinformatics Group
Most of projects came back months or years after initial data were deliveredCould not remember how a particular set of files was generatedCould not remember what parameters were used
Murphy’s Law: Everything you do, you will probably have to do it over againRe‐run analysis with outlier removedTest different data sets
Cross‐site collaborative/team science projectsDiscrepancy among different sites or groupsOS, Software versions, libraries, dependencies, genome annotations, etc
Benefit from Reproducible Computational Research
Increase efficiencyNumbers and figures can be easily updated when data change occurs.Easy to look up for results and put them in manuscript
Enable continuityReproduce the results generated months beforeProjects transfer among different staffs
Enhance collaborationData and scripts sharing
10 Simple Rules for Reproducible Computational Research
1. For Every Result, Keep Track of How It Was Produced2. Avoid Manual Data Manipulation Steps3. Archive the Exact Versions of All External Programs Used4. Version Control All Custom Scripts5. Record All Intermediate Results, When Possible in Standardized Formats6. For Analyses That Include Randomness, Note Underlying Random Seeds7. Always Store Raw Data behind Plots8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected9. Connect Textual Statements to Underlying Results10. Provide Public Access to Scripts, Runs, and Results
Sandve GK, et al PLoS Comput Biol. 2013
Record Everything
Automate Everything
Strategies for Reproducible Computational Research
Narrative descriptions are a simple but valuable way to support computational reproducibility.
Custom scripts and codes can automate research analysis Software frameworks enable easier handling of software dependencies
Literate programming combines narratives with codes Workflow management systems enable software to be executed via a graphical user interface
Virtual machines encapsulate an entire operating system and software dependencies
Software containers ease the process of installing and configuring dependencies
Piccolo and Frampton Gigascience 2016
Establish Reproducible Environment
Centralized open source programsVersions of package and / or libraries
Centralized reference genomes and annotations Consistent file and directory organization, with logical namesPI folder Project folder
‐ analysis sub‐folder; code, data, results etc
Clear documentation Shared the codes and scripts Ideally, use tools to recreate or preserve the environmentVersion control with GitDocker
Reproducible Report
R markdown/knitr or Python Ipython/Jupyter notebook to generate reproducible reports
Data Sharing
Data Commons E‐Lab: social network environment for collaborative projects
Adopting a Prevention Approach
To ensure reproducibility and replicability by engaging in a more preventative approach that greatly expands data analysis education and routinely uses software tools.
Leek and Peng PNAS 2015
The Challenges
P‐hackingAlteration of data to align with hypothesesSelective Reporting
HARKing (Hypothesizing After the Results are Known)Alteration of hypotheses to align with dataLow statistical power
Openness and transparencyMethod and data sharing
Experimental DesignSample size
The Biggest Challenge - Incentives
Good for Science
RigorQuality
Reproducibility
Good for PI/Institute
PublicationQuantityNovelty
Find the balance between pushing reproducible research and keeping up with the leading edge techniques and development
Modified from www.jimgrange.wordpress.com
Acknowledgements
NIEHS – IBGGrimm SaraBennett BrianBurkholder AdamKlimczak LesLavender AndyLi Jianying
SBP – OrlandoRanjan PereraFeng QiStacy Huang
SBP – La JollaAndrei OstermanAlexey EroshkinRoy WilliamsAlly Perlina
SBP – ITDavid HuhtaDerek Roberts
Papas BrianRandall TomWard JamesWang TyXu Xiaojiang