Docker in Open Science Data Analysis Challenges
Bruce HoffPrincipal Software Engineer, Sage Bionetworks
Open Science in Disease Research
Containerization as a tool for scientific reproducibility
Case Study: Docker in the 2015 ALS Stratification Challenge
Case Study: Docker in the 2016 Digital Mammography Challenge
Open Issues and Lessons Learned
Agenda
This talk is about saving lives.
Disease research is data intensive…
… but published analyses often aren’t
reproducible.
… and valuable data sets aren’t shared freely.… which reduces the rate of progress.
Difficulties in science validation Amgen scientists tried to confirm 53 landmark papers in pre-clinical
oncology research: Only 6 (11%) were confirmed.[1] Bayer HealthCare reported that only about 25% of published
preclinical studies could be validated.[2] Poti Gate: Genomics Research at Duke during 2006-2010, led to the
identification of Diagnostic Signatures that spurred clinical trials. The research was later deemed statistically flawed and the clinical trials stopped
[1] C. Glenn Begley and Lee M. Ellis, Nature 483, 531 (2012)[2] Prinz,F.,Schlange,T.&Asadullah,K., NatureRev. Drug Discov. 10, 712 (2011)
Our Solution: Open Data Analysis Challenges
Engage the community, rather than a select company or lab, to solve a problem in biological/medicinal research.
Obtain and expose a high value data set that would otherwise be accessible by a few.
Require that participants share their code and document their algorithms; test for reproducibility.
DREAM 2
DREAM 3
DREAM 4
DREAM 5
DREAM 6
DREAM 7
DREAM 8
DREAM8.5
+9
DREAM9.5
+100
50
100
150
200
250
300
350
400
450
Unique Final submissionsN
umbe
r of
Sub
mitt
ing
Team
s
A decade of participation
Measures of Impact• 32 scientific challenges• 50 partner institutions (since 2006)• >5000 registered users • 10 international conferences • 2500 conference attendees• >100 publications using DREAM data• 25 journal articles • 3 journal special issues• 2 edited books• 1,300 Citations• 20 PhD theses • Use of Challenges in Classroom as problem sets
Dialogue for Reverse Engineering Assessment and Methods (DREAM) is a crowdsourcing effort that poses quantitative challenges about systems biology modeling.
Sage Bionetworks (2009-) is a nonprofit biomedical research organization seeking to accelerate biomedical research through open systems, incentives, and standards.
The two organization merged in 2013 to drive a continuing series of open science challenges.
The Organization
• Web services that facilitate collaborative web science– Projects Sharing Resources (code, files, ideas)– wiki narratives
• Analysis provenance - linking data, code, and results; data versioning
• Web services that facilitate Challenge logistics– Registration, acceptance of data usage, acceptance of Challenge Terms and Conditions– Real-time challenge leaderboards– Discussion Forum– Formation of Teams– Online Supplement for Challenge Papers: e.g.:
https://www.synapse.org/#!Synapse:syn2528824/wiki/
Synapse: enabling collaborative research
2015 ALS Challengea case study in using Docker in a DREAM challenge
ALS is a rapidly progressing neurodegenerative disease that typically leads to death within 3-5 years but for which disease progression is heterogeneous across the patient population.
Data for 9000 ALS patients provided by the Pooled Resources Open-Access Clinical Trial (PRO-ACT) database.
The challenge was to predict disease progression from clinical data.
$28,000 in prize money raised through a grass-roots fund drivehttps://www.indiegogo.com/projects/fund-the-prize-solve-als-together
Nature Biotechnology agreed to publish the results.
In a typical challenge…• Data is partitioned into
– training– leaderboard– validation
• Participants– download training data– apply statistical learning methods– submit predictions
Organizers want to constrain submitted models to work in a certain way:• Model has a ‘selector’ component to select predictive clinical features• Model has a ‘predictor’ component to predict ALS outcome based on
selected features.
Organizers want to run each model themselves to:- Ensure models are structured as prescribed- Ensure reproducibility of output
Docker to the rescue!
Clinical Data
Model OutputSelector Selected
Features Predictor
Scientific Leadership
High value data set
IT Resources
Prize Money
High visibility Publication
Community participation
The ‘Stone Soup’ of Open Challenges
IBM Cloud with a ZEC12 system virtual machine running a Linux server with 32 processors, 240 GB memory and 9 TB storage space.
IBM Donates a Mainframe for ALS Challenge
Provision a container on a unique port for each participant. They log in as:> ssh [email protected] -p port_number
Provide a script that sends a “signal” to a process running Docker> create_model_snapshot
Back-end process runs “docker commit” to create a copy of the model for scoring.
Back-end reruns captured image as a new container, after mounting leaderboard (or later, validation) data volume.
Using Docker with a Mainframe
2016 Digital Mammography Challengea case study in using Docker in a DREAM challenge
• The Scientific Question: How can we reduce erroneous recall rate (false positives)?
• Image analysis machine learning problem• “Deep learning” algorithms expected• $1.2M in prize money expected to attract 100s of serious
participants• 600,000 mammography images donated (~20TB)• Budget for 100s of GPU servers from two Cloud providers
(AWS, IBM)
Why use Docker?1) Large data size2) Sensitive data3) Provisioned compute
(1) Allocate machine (e.g. own laptop)
(2) Retrieve base image (3) Retrieve
small, pilot dataset.
(4) Create model
(5) Verify model using pilot dataset
(6) Push Dockerized model to registry
(8) Receive trained model and score.
…(7) Submit model to Challenge.
Submission queue built into Synapse
(1) Retrieve new submissions.
…
(2) Retrieve Docker image.
(3) Train / score model.
(4) Save trained model and score.
• We’ve implemented the data donor’s wish to maintain control of the data.
• We have obviated the need to download the large data set.• We have democratized participation, making compute available
to those who might not otherwise have it.• After the challenge we have a library of rerunnable models
ensuring reproducibility.
Outcome
• How best to monitor a fleet of Docker hosts (incl. GPU usage)?• How reproducible are models run on different GPU machines?
How much of the software stack should be in the container?• How shall we limit submitted jobs?• Are there networking issues as models access data?• What are the security issues when running submitted
containers?
Open questions
• Images aren't always portable. System Z images can't be used on Intel-based hardware.
• Reproducibility doesn't mean comprehensibility• Find out about all our challenges at www.synapse.org• For those of you down in the trenches, see brucehoff/dockerauth
for an example of how to do registry delegated authorization in Java.
/etc
Acknowledgements Sage Bionetworks
Stephen Friend Thea Norman Lara Mangravite Mike Kellen Mette Peters Arno Klein Solly Sieberts Abhi Pratap Chris Bare Bruce Hoff
IBM Erhan Bilal Kely Norel Elise Blaese Pablo Meyer Rojas Kahn Rrhissorrakrai
EBI Julio Saez Rodriguez Thomas Cokelaer Federica Eduati Michael Menden
L. Maximilians University Robert Kueffner,
Univ Colorado, Denver Jim Costello
OHSU Joe Gray Adam Margolin Mehmet Gonen Laura Heiser
Prize4Life Melanie Leitnerr Neta Zach
NCI Dinah Singer Dan Gallahan
ISMMS Eli Stahl Gaurav Pandey
Columbia University Andrea Califano Mukesh Bansal Chuck Karan
Rice University Amina Qutub David Noren Byron Long
MD Anderson Steven Kornblau
Univ of Lausanne Daniel Marbach
Broad Institute Bill Hahn Barbara Weir Aviad Tsherniak
Merck Robert Plenge
BYU Keoni Kauwe
OICR Paul Boutros
UCSC Josh Stuart
Thank you!
• Science Translational Medicine (1 paper)
• Nature Biotechnology (4 papers)
• Nature Genetics (papers in preparation)
• Nature Methods (papers in preparation)
• Nature Neuroscience (papers in preparation)
• PLoS Computational Biology (papers in review and preparation)
• National Cancer Institute (contracts for Best Performers)
Challenge Assisted Peer Review Partners
A crowdsourcing effort that poses quantitative challenges about systems biology modeling and data analysis on:
Transcriptional and signaling networks, Predictions of response to perturbations, Translational research (tox, RA, AD, ALS, AML, …)
Our mission is to contribute to the solution of important biomedical problems to foster collaboration between research groups to democratize data to accelerate research to objectively assess algorithm performance
What are the DREAM Challenges
Peer review is subjective. But even if it were not, what comes to the reviewers may be biased: Bias against publication of negative results or results contrary to
published results Incentive structure put researchers under considerable pressure to try
until they find a positive result (multiple testing, over-fitting, etc.)
Dani Brunner et al., Behavioral processes 89, 187-195 (2012)
Inflated Statistical Significance
Multiple TestingSelective Reporting
Overfitting
Benefits of crowd-sourcing• Performance Evaluation
– Unbiased, consistent, and rigorous method assessment– Unbiased comparison and discovery of best methods– Determine the solvability of a scientific question
• Sampling of the space of methods– Understand the diversity of methodologies presently being
used to solve a problem
Benefits of crowd-sourcing• Acceleration of Research
– The community of participants can do in 4 months what would take 10 years to any group
• Community Building– Make high quality, well-annotated data accessible– Foster community collaborations on fundamental research questions– Determine robust solutions through community consensus: “The Wisdom
of Crowds”
• Disease research is data intensive. A typical researcher has a PhD in multivariate statistics and does a lot of programming in languages like R, Python, and Matlab, using libraries of established tools.
• So these analyses are software stacks of a sort, each piece having the typical series of revisions.
• This makes reproducibility really challenging: To reproduce an analysis you need not only the original data and the statistical processing script written by the author, but the correct versions of all the dependencies.
• Obviously containerization offers a powerful tool for reproducibility: the entire software stack used in an analysis can be tracked.
The challenge of reproducibility
Top Related