Download - Docker in Open Science Data Analysis Challenges by Bruce Hoff

Docker in Open Science Data Analysis Challenges

Bruce HoffPrincipal Software Engineer, Sage Bionetworks

Open Science in Disease Research

Containerization as a tool for scientific reproducibility

Case Study: Docker in the 2015 ALS Stratification Challenge

Case Study: Docker in the 2016 Digital Mammography Challenge

Open Issues and Lessons Learned

Agenda

This talk is about saving lives.

Disease research is data intensive…

… but published analyses often aren’t

reproducible.

… and valuable data sets aren’t shared freely.… which reduces the rate of progress.

Difficulties in science validation Amgen scientists tried to confirm 53 landmark papers in pre-clinical

oncology research: Only 6 (11%) were confirmed.[1] Bayer HealthCare reported that only about 25% of published

preclinical studies could be validated.[2] Poti Gate: Genomics Research at Duke during 2006-2010, led to the

identification of Diagnostic Signatures that spurred clinical trials. The research was later deemed statistically flawed and the clinical trials stopped

[1] C. Glenn Begley and Lee M. Ellis, Nature 483, 531 (2012)[2] Prinz,F.,Schlange,T.&Asadullah,K., NatureRev. Drug Discov. 10, 712 (2011)

Our Solution: Open Data Analysis Challenges

Engage the community, rather than a select company or lab, to solve a problem in biological/medicinal research.

Obtain and expose a high value data set that would otherwise be accessible by a few.

Require that participants share their code and document their algorithms; test for reproducibility.

DREAM 2

DREAM 3

DREAM 4

DREAM 5

DREAM 6

DREAM 7

DREAM 8

DREAM8.5

+9

DREAM9.5

+100

50

100

150

200

250

300

350

400

450

Unique Final submissionsN

umbe

r of

Sub

mitt

ing

Team

s

A decade of participation

Measures of Impact• 32 scientific challenges• 50 partner institutions (since 2006)• >5000 registered users • 10 international conferences • 2500 conference attendees• >100 publications using DREAM data• 25 journal articles • 3 journal special issues• 2 edited books• 1,300 Citations• 20 PhD theses • Use of Challenges in Classroom as problem sets

Dialogue for Reverse Engineering Assessment and Methods (DREAM) is a crowdsourcing effort that poses quantitative challenges about systems biology modeling.

Sage Bionetworks (2009-) is a nonprofit biomedical research organization seeking to accelerate biomedical research through open systems, incentives, and standards.

The two organization merged in 2013 to drive a continuing series of open science challenges.

The Organization

• Web services that facilitate collaborative web science– Projects Sharing Resources (code, files, ideas)– wiki narratives

• Analysis provenance - linking data, code, and results; data versioning

• Web services that facilitate Challenge logistics– Registration, acceptance of data usage, acceptance of Challenge Terms and Conditions– Real-time challenge leaderboards– Discussion Forum– Formation of Teams– Online Supplement for Challenge Papers: e.g.:

https://www.synapse.org/#!Synapse:syn2528824/wiki/

Synapse: enabling collaborative research

https://www.synapse.org/%23!Synapse:syn2528824/wiki/

https://www.synapse.org/%23!Synapse:syn2528824/wiki/

2015 ALS Challengea case study in using Docker in a DREAM challenge

ALS is a rapidly progressing neurodegenerative disease that typically leads to death within 3-5 years but for which disease progression is heterogeneous across the patient population.

Data for 9000 ALS patients provided by the Pooled Resources Open-Access Clinical Trial (PRO-ACT) database.

The challenge was to predict disease progression from clinical data.

$28,000 in prize money raised through a grass-roots fund drivehttps://www.indiegogo.com/projects/fund-the-prize-solve-als-together

Nature Biotechnology agreed to publish the results.

In a typical challenge…• Data is partitioned into

– training– leaderboard– validation

• Participants– download training data– apply statistical learning methods– submit predictions

Organizers want to constrain submitted models to work in a certain way:• Model has a ‘selector’ component to select predictive clinical features• Model has a ‘predictor’ component to predict ALS outcome based on

selected features.

Organizers want to run each model themselves to:- Ensure models are structured as prescribed- Ensure reproducibility of output

Docker to the rescue!

Clinical Data

Model OutputSelector Selected

Features Predictor

Scientific Leadership

High value data set

IT Resources

Prize Money

High visibility Publication

Community participation

The ‘Stone Soup’ of Open Challenges

IBM Cloud with a ZEC12 system virtual machine running a Linux server with 32 processors, 240 GB memory and 9 TB storage space.

IBM Donates a Mainframe for ALS Challenge

Provision a container on a unique port for each participant. They log in as:> ssh [email protected] -p port_number

Provide a script that sends a “signal” to a process running Docker> create_model_snapshot

Back-end process runs “docker commit” to create a copy of the model for scoring.

Back-end reruns captured image as a new container, after mounting leaderboard (or later, validation) data volume.

Using Docker with a Mainframe

2016 Digital Mammography Challengea case study in using Docker in a DREAM challenge

• The Scientific Question: How can we reduce erroneous recall rate (false positives)?

• Image analysis machine learning problem• “Deep learning” algorithms expected• $1.2M in prize money expected to attract 100s of serious

participants• 600,000 mammography images donated (~20TB)• Budget for 100s of GPU servers from two Cloud providers

(AWS, IBM)

Why use Docker?1) Large data size2) Sensitive data3) Provisioned compute

(1) Allocate machine (e.g. own laptop)

(2) Retrieve base image (3) Retrieve

small, pilot dataset.

(4) Create model

(5) Verify model using pilot dataset

(6) Push Dockerized model to registry

(8) Receive trained model and score.

…(7) Submit model to Challenge.

Submission queue built into Synapse

(1) Retrieve new submissions.

…

(2) Retrieve Docker image.

(3) Train / score model.

(4) Save trained model and score.

• We’ve implemented the data donor’s wish to maintain control of the data.

• We have obviated the need to download the large data set.• We have democratized participation, making compute available

to those who might not otherwise have it.• After the challenge we have a library of rerunnable models

ensuring reproducibility.

Outcome

• How best to monitor a fleet of Docker hosts (incl. GPU usage)?• How reproducible are models run on different GPU machines?

How much of the software stack should be in the container?• How shall we limit submitted jobs?• Are there networking issues as models access data?• What are the security issues when running submitted

containers?

Open questions

• Images aren't always portable. System Z images can't be used on Intel-based hardware.

• Reproducibility doesn't mean comprehensibility• Find out about all our challenges at www.synapse.org• For those of you down in the trenches, see brucehoff/dockerauth

for an example of how to do registry delegated authorization in Java.

/etc

http://www.synapse.org/

Acknowledgements Sage Bionetworks

Stephen Friend Thea Norman Lara Mangravite Mike Kellen Mette Peters Arno Klein Solly Sieberts Abhi Pratap Chris Bare Bruce Hoff

IBM Erhan Bilal Kely Norel Elise Blaese Pablo Meyer Rojas Kahn Rrhissorrakrai

EBI Julio Saez Rodriguez Thomas Cokelaer Federica Eduati Michael Menden

L. Maximilians University Robert Kueffner,

Univ Colorado, Denver Jim Costello

OHSU Joe Gray Adam Margolin Mehmet Gonen Laura Heiser

Prize4Life Melanie Leitnerr Neta Zach

NCI Dinah Singer Dan Gallahan

ISMMS Eli Stahl Gaurav Pandey

Columbia University Andrea Califano Mukesh Bansal Chuck Karan

Rice University Amina Qutub David Noren Byron Long

MD Anderson Steven Kornblau

Univ of Lausanne Daniel Marbach

Broad Institute Bill Hahn Barbara Weir Aviad Tsherniak

Merck Robert Plenge

BYU Keoni Kauwe

OICR Paul Boutros

UCSC Josh Stuart

Thank you!

• Science Translational Medicine (1 paper)

• Nature Biotechnology (4 papers)

• Nature Genetics (papers in preparation)

• Nature Methods (papers in preparation)

• Nature Neuroscience (papers in preparation)

• PLoS Computational Biology (papers in review and preparation)

• National Cancer Institute (contracts for Best Performers)

Challenge Assisted Peer Review Partners

A crowdsourcing effort that poses quantitative challenges about systems biology modeling and data analysis on:

Transcriptional and signaling networks, Predictions of response to perturbations, Translational research (tox, RA, AD, ALS, AML, …)

Our mission is to contribute to the solution of important biomedical problems to foster collaboration between research groups to democratize data to accelerate research to objectively assess algorithm performance

What are the DREAM Challenges

Peer review is subjective. But even if it were not, what comes to the reviewers may be biased: Bias against publication of negative results or results contrary to

published results Incentive structure put researchers under considerable pressure to try

until they find a positive result (multiple testing, over-fitting, etc.)

Dani Brunner et al., Behavioral processes 89, 187-195 (2012)

Inflated Statistical Significance

Multiple TestingSelective Reporting

Overfitting

Benefits of crowd-sourcing• Performance Evaluation

– Unbiased, consistent, and rigorous method assessment– Unbiased comparison and discovery of best methods– Determine the solvability of a scientific question

• Sampling of the space of methods– Understand the diversity of methodologies presently being

used to solve a problem

Benefits of crowd-sourcing• Acceleration of Research

– The community of participants can do in 4 months what would take 10 years to any group

• Community Building– Make high quality, well-annotated data accessible– Foster community collaborations on fundamental research questions– Determine robust solutions through community consensus: “The Wisdom

of Crowds”

• Disease research is data intensive. A typical researcher has a PhD in multivariate statistics and does a lot of programming in languages like R, Python, and Matlab, using libraries of established tools.

• So these analyses are software stacks of a sort, each piece having the typical series of revisions.

• This makes reproducibility really challenging: To reproduce an analysis you need not only the original data and the statistical processing script written by the author, but the correct versions of all the dependencies.

• Obviously containerization offers a powerful tool for reproducibility: the entire software stack used in an analysis can be tracked.

The challenge of reproducibility