Bioinformatics Data Pipelines built by CSIRO on AWS
-
Upload
lynn-langit -
Category
Science
-
view
526 -
download
1
Transcript of Bioinformatics Data Pipelines built by CSIRO on AWS
Cancer GenomicsData PipelinesLynn & Samantha Langit CSIRO Bioinformatics / Australia
June 2017 - Oslo
3 Billion data points per patient DNA sampleUp to 25% of the population could be sequenced by 2025
Two Perspectives
Bioinformatics
Research• Insight
• Reproducibility
Cloud
Architecture• Speed
• Low Cost
• Simplicity
Cloud Data Pipeline Pattern
Problem
• Define business problem
Data
• Quality
• Quantity
Candidate Technologies
• Ingest
• ETL
• Biz Analytics
• ML
• Visualization
Build MVPs
• Iterate
• Learn
Assemble Pipeline
• Validate each section
• Test at scale
Genomic Sequencing Results
CRISPR-Cas9 for molecular engineering technology
enables the accurate editing of genomes for researchers.
It…
Pattern-matching unique sequences of DNA
Huge demand for large-scale computation
Time-critical dimension to compute
NIH-approved for human health
Could revolutionize cancer treatments
Serverless Lambda Architecture Pattern
Lambda
function
1
Lambda
function
2
Lambda
function
3
buckets with
objects DynamoDB
API Gateway Users
CSIRO: Commonwealth Scientific & Industrial Research Organization
GT-Scan2Demo
GT-Scan2
Scale Genomic Analysis
GWAS = genome-wide sequencing data association studies
Analysis on large cohort data or imputed SNP array data
Clustering on genomic profiles to stratify large-cohort genomic data
Viewing datasets with millions of features
Cloud Data Pipeline Pattern
Problem
• Define business problem
Data
• Quality
• Quantity
Candidate Technologies
• Ingest
• ETL
• Biz Analytics
• ML
• Visualization
Build MVPs
• Iterate
• Learn
Assemble Pipeline
• Validate each section
• Test at scale
Genomics (ML) Pipeline Pattern
What is CSIRO’s solution?
For Scale at reasonable cost Use Apache Hadoop
For Scale at speed Use Apache Spark for Hadoop
For Usability in bioinformatics
Create a domain-specific API (OSS library)
For global useLeverage Cloud Pipeline Patterns
GWAS Analysis with Variant-Spark
On premise Hadoop Cluster
with Apache Spark
Genomics Analysts
corporate data center
What is Apache Spark?
What is variant-spark?
Demo
80% faster than ADAM
90% faster than R
90% faster than Python
VariantSpark
Uses Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently
Analyzes 3,000 samples with 80 million features in < 30 minutes
Enables real-time diagnosis by finding similar patients
Contributes to motor neuron disease (ALS) research in Australia
Data Prep
Statistics
Probabilistic Algorithms
Data Viz
Machine Learning…
Spark ML Classification Algorithms
Wide Random Forest Ensembleof Decision Trees
Logistic Regression
variant-spark other libraries
OSS Library variant-spark for all
usable? performant?
extendable? (clean code)
using the best language (Scala)?
using the ‘best version’ of Spark?
using a version of wide random forests that is understandable?
Is it…
How best to Deploy Cloud Hadoop?
• IaaS EC2 instances with Apache Hadoop, Apache Spark, more…
• PaaS Elastic Map Reduce (EMR) Hadoop cluster
• SaaS Vendor-managed, i.e. DataBricks w/Jupyter Notebooks
What is Databricks?
DEMO: Jupyter Notebooks
Variant-Spark and DatabricksDemo
SolvingImportant Questions…Cancer Genomics?
DEMO: Who is a Hipster?
AWS EC2 Spot Instances
GWAS Analysis with Variant-Spark
EC2 Hadoop Cluster with Apache Spark
Genomics Analysts
Availability Zone
1000 Genomes
GWAS input
Spot EC2 Hadoop
worker instancesEC2 Hadoop
instances
Cloud Data Pipeline Pattern
Problem DataCandidate
TechnologiesBuild MVPs
Assemble Pipeline
Analyze GWAS -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
SaaS
Cloud Data Pipeline Pattern
Problem DataCandidate
TechnologiesBuild MVPs
Assemble Pipeline
1. Scan vcf -> S3/DynamoDB Ingest
ETL
Analyze
Viz
S3
Lambda
Lambda
Lambda/API Gateway
Serverless
2. Analyze GWAS -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
SaaS
Modern Big Data Pipelines• Problem #1 - Scan
• Solution: Serverless Cloud Pipeline
• Problem # 2 - Analyze
• Solution: SaaS Cloud ML Pipeline
Cancer GenomicsData PipelinesLynn & Samantha Langit CSIRO Bioinformatics & variant-spark
June 2017 - Oslo