Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
-
Upload
wadeschulz -
Category
Documents
-
view
219 -
download
0
Transcript of Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
1/24
S L I D E 0
Composable, Petabyte-Scale
Genomics Workflows with Dockerand Luigi
Wade L. Schulz, MD, PhD, Henry M. Rinder, MD, Richard Torres, MD, MS, Alexa Siddon, MD
Resident, Department of Laboratory Medicine, Yale School of Medicine
Senior Solution Architect, Helix Data Sciences, Yale-New Haven Hospital
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
2/24
S L I D E 1
Notice of Faculty Disclosure
In accordance with ACCME guidelines, any individual in a position toinfluence and/or control the content of this ASCP CME activity hasdisclosed all relevant financial relationships within the past 12 months with commercial interests that provide products and/or servicesrelated to the content of this CME activity.
The individual below has responded that he/she has no relevantfinancial relationship(s) with commercial interest(s) to disclose:
Wade Schulz, MD, PhD
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
3/24
S L I D E 2
Composable, Petabyte-Scale Genomics Workflows withDocker and Luigi
• Clinical question and background• Open/Big Data
• Barriers to Large-Scale Genomics (Big Data) Analysis
• Pipeline Improvements
•
Case Study with Performance Metrics
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
4/24
S L I D E 3
Clinical Question – Tumor Heterogeneity
Hypothesis: Patients with acute myeloid leukemia (AML) who present with multiple hematopoietic clones have a worse prognosis thanpatients with a single clone.
http://commons.wikimedia.org/wiki/File:Treatment_bottleneck.pdf - Lcchong
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
5/24
S L I D E 4
Predictions of Clonal Populations
• In vitro analysis by single-cell sequencing• In silico prediction based on clustering of variants by variant allele
frequency
Miller, C. a., White, B. S., Dees, N. D., et al. (2014). SciClone: Inferring Clonal Architecture and Tracking the Spatial and TemporalPatterns of Tumor Evolution. PLoS Computational Biology, 10(8), e1003665. doi:10.1371/journal.pcbi.1003665
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
6/24
S L I D E 5
Impact of Molecular Heterogeneity
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
7/24S L I D E 6
Somatic Mutations in Cancer
• How do we:– Increase our N
– Increase the number of identified variants
Grove, C. S., & Vassiliou, G. S. (2014). Acute myeloid leukaemia: a paradigmfor the clonal evolution of cancer? Disease Models & Mechanisms,7 (8), 941–951.
AML, Somatic, Whole Exome
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
8/24S L I D E 7
Open Data
• Clinical Trials• Primary Research Data
• Government Data
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
9/24S L I D E 8
The Cancer Genome Atlas (TCGA)
• TCGA: comprehensive and coordinated effort to accelerate ourunderstanding of the molecular basis of cancer through theapplication of genome analysis technologies
• cgHub: Genomics data repository, contains >1.4 petabytes of data
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
10/24S L I D E 9
Analysis Architecture
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
11/24S L I D E 10
Barriers to High-Throughput Analysis
•
Manual Workflow/Fault Tolerance– Need to process 200 patients
– Errors require manual restart
• Bandwidth Throughput
– Gigabit internet connectivity
– More limited (100 Mbit) when throttled
• Drive Space
– ~110 GB per WGS BAM file, paired tumor/normal for each patient
• Processor/Memory Capacity
– Downstream application lack parallelization
– Unable to run multiple instances due to software design
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
12/24S L I D E 11
Architectural Improvments
• Workflow Creation– Oozie, Luigi
• Containerization
– Docker
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
13/24S L I D E 12
Why (Luigi) Workflows?
• Automation– Fault Tolerance
• Improved Resource Utilization
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
14/24S L I D E 13
Architectural Improvments
• Workflow Creation– Oozie, Luigi
• Containerization
– Docker
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
15/24S L I D E 14
Containerization
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
16/24S L I D E 15
Why Containers?
• Standardization– Software Matched to OS Version
• Isolation
– Software Validation
• Parallelization
– Processor/Memory Capacity
• Clustering
– Bandwidth Throughput
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
17/24
S L I D E 16
Docker Containers
Infrastructure
Hypervisor
OS 1 OS 2 OS 3
Libs Libs Libs
App 1 App 2A App 2B
Infrastructure
Hypervisor
OS/Docker Engine
Libs
App 1 App 2A App 2B
Libs
Virtual Servers Containers
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
18/24
S L I D E 17
Updated Analysis Architecture
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
19/24
S L I D E 18
Pipeline Performance Comparison
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
20/24
S L I D E 19
Virtualized Performance Characteristics
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
21/24
S L I D E 20
Pipeline Performance Comparison
//
//
10 days
3 days
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
22/24
S L I D E 21
Conclusions
• Workflow automation can increase throughput– Fewer manual steps
– Continual and immediate data processing
• Containerization can improve throughput of large,computationally-intensive data sets
– Applications that do not support parallelization– Applications with complex or unsupported dependencies
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
23/24
S L I D E 22
Future Directions
• Can additional throughput improvements be made by clustering?– Deployment of containers through Docker Swarm
– When deployed to our data science cluster, expected that 1 petabytecan be entirely processed in
-
8/16/2019 Composable, Petabyte-Scale Genomics Workflows with Docker and Luigi
24/24
Questions?
• Docker (v1.9.1, docker.com)
• Luigi (v2.0.1, github.com/spotify/luigi)
– https://hub.docker.com/r/wadeschulz/luigi
• GeneTorrent (v3.8.7, cghub.ucsc.edu)
– https://hub.docker.com/r/molecular/cghub
– https://hub.docker.com/r/molecular/cgdownload
•
SomaticSniper (v1.0.5.0, gmt.genome.wustl.edu)– https://hub.docker.com/r/molecular/somaticsniper
• http://wadeschulz.com/portfolio/api-2016