What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or...

20
JianLiang (Jason) Li Ph.D. Director, Integrative Bioinformatics Group National Institute of Environmental Health Sciences Towards Reproducible Cancer Genomics Implementing Rigor And Reproducibility in Bioinformatics Group

Transcript of What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or...

Page 1: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Jian‐Liang (Jason) Li Ph.D.Director, Integrative Bioinformatics Group

National Institute of Environmental Health Sciences

Towards Reproducible Cancer GenomicsImplementing Rigor And Reproducibility 

in Bioinformatics Group

Page 2: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

What is Reproducibility

WikipediaReproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone else working independently, whereas reproducing an experiment is called replicating it.

Reproducibility for Computational AnalysisAn analysis is described or captured in sufficient detail that it can be precisely reproduced (James Taylor)

Page 3: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Is There a Reproducibility Crisis?

90 percent agreed that such a crisis exists More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments

Baker M Nature 2016

Page 4: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Reproducibility Movement

New NIH Requirements for Grant ProposalsRigor and Reproducibility

Page 5: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Why Should We Implement Rigor and Reproducibility in Bioinformatics Group

Share responsibility

Science rapidly become data intensive, and all biology is computational biology nowBioinformatics group plays an important role in current biomedical researches More and more research groups rely on bioinformaticians or bioinformatics group for their data analyses and interpretations

Page 6: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Research Project Workflow

Project Conception

ExperimentalDesign

Data Analysis

DataInterpretation

Project Conclusions

Some published data analyses are not reproducible:Analysis might not be performed as describedMissing software, version, parameters, data, etc

Page 7: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Research Project Workflow

Project Conception

ExperimentalDesign

Data Analysis

DataInterpretation

Project Conclusions

Data Sharing

Embedded ModelEnhance

Reproducibility for Scientific Study

Page 8: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Experimental Design

Partner with Genomics core to help with design questionsExploratory study and confirmatory studyControls;  Sample Size; Coverage;  Randomization

Standardized project submission form, required to consult with Genomics and Bioinformatics staffs before submit the sampleBackground; Hypothesis; Experimental design; Data Analysis Strategy

NGS project review committee

Page 9: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Data Analysis

Quality Control Establish Standard Operating Procedures Automate routine analyses

Page 10: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Scenario in Bioinformatics Group

Most of projects came back months or years after initial data were deliveredCould not remember how a particular set of files was generatedCould not remember what parameters were used

Murphy’s Law:  Everything you do, you will probably have to do it over againRe‐run analysis with outlier removedTest different data sets

Cross‐site collaborative/team science projectsDiscrepancy among different sites or groupsOS, Software versions, libraries, dependencies, genome annotations, etc

Page 11: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Benefit from Reproducible Computational Research

Increase efficiencyNumbers and figures can be easily updated when data change occurs.Easy to look up for results and put them in manuscript

Enable continuityReproduce the results generated months beforeProjects transfer among different staffs

Enhance collaborationData and scripts sharing

Page 12: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

10 Simple Rules for Reproducible Computational Research

1.   For Every Result, Keep Track of How It Was Produced2.   Avoid Manual Data Manipulation Steps3.   Archive the Exact Versions of All External Programs Used4.   Version Control All Custom Scripts5.   Record All Intermediate Results, When Possible in Standardized Formats6.   For Analyses That Include Randomness, Note Underlying Random Seeds7.   Always Store Raw Data behind Plots8.   Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected9.   Connect Textual Statements to Underlying Results10. Provide Public Access to Scripts, Runs, and Results

Sandve GK, et al PLoS Comput Biol.  2013 

Record Everything

Automate Everything

Page 13: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Strategies for Reproducible Computational Research

Narrative descriptions are a simple but valuable way to support computational reproducibility.

Custom scripts and codes can automate research analysis Software frameworks enable easier handling of software dependencies

Literate programming combines narratives with codes Workflow management systems enable software to be executed via a graphical user interface

Virtual machines encapsulate an entire operating system and software dependencies

Software containers ease the process of installing and configuring dependencies

Piccolo and Frampton  Gigascience 2016

Page 14: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Establish Reproducible Environment

Centralized open source programsVersions of package and / or libraries

Centralized reference genomes and annotations Consistent file and directory organization, with logical namesPI folder Project folder

‐ analysis sub‐folder; code, data, results etc

Clear documentation Shared the codes and scripts  Ideally, use tools to recreate or preserve the environmentVersion control with GitDocker

Page 15: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Reproducible Report

R markdown/knitr or Python Ipython/Jupyter notebook to generate reproducible reports

Page 16: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Data Sharing

Data Commons E‐Lab:  social network environment for collaborative projects

Page 17: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Adopting a Prevention Approach

To ensure reproducibility and replicability by engaging in a more preventative approach that greatly expands data analysis education and routinely uses software tools.

Leek and Peng PNAS 2015

Page 18: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

The Challenges

P‐hackingAlteration of data to align with hypothesesSelective Reporting

HARKing (Hypothesizing After the Results are Known)Alteration of hypotheses to align with dataLow statistical power

Openness and transparencyMethod and data sharing

Experimental DesignSample size

Page 19: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

The Biggest Challenge - Incentives

Good for Science

RigorQuality

Reproducibility

Good for PI/Institute

PublicationQuantityNovelty

Find the balance between pushing reproducible research and keeping up with the leading edge techniques and development

Modified from www.jimgrange.wordpress.com

Page 20: What is Reproducibility · Reproducibility is the ability of an entire analysis of an experiment or study to be duplicated, either by the same researcher or by someone ... More and

Acknowledgements

NIEHS – IBGGrimm SaraBennett BrianBurkholder AdamKlimczak LesLavender AndyLi Jianying

SBP – OrlandoRanjan PereraFeng QiStacy Huang

SBP – La JollaAndrei OstermanAlexey EroshkinRoy WilliamsAlly Perlina

SBP – ITDavid HuhtaDerek Roberts

Papas BrianRandall TomWard JamesWang TyXu Xiaojiang