Compute Compute Cluster Deployment GuideCluster Deployment Guide
Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure
-
Upload
nick-brown -
Category
Technology
-
view
179 -
download
0
Transcript of Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure
Genome (FASTQ and VCF) Simulation & Applications
Hariprasad RadhakrishnanAstraZeneca, Technology Labs, UK
Genome Simulation
AstraZeneca
We are a global, science-led
biopharmaceutical business
pushing the boundaries of science
to deliver life-changing medicines.
61,500employees worldwide
$23bn2016 Revenue*
100+Countries
Our Experiment –
• Quick introduction to DNA - DNA Sequencing
• synthetically generated Genome data (FASTQ & VCF)!
• How to Scale/ run Distributed Compute using Kubernetes and
Docker
Hari works as an Associate Architect - Data &
Analytics in the UK Tech Incubation Lab of
AstraZeneca.
Genome Simulation
Introduction
DNA & How it is Sequenced
Genome Simulation
AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCG
TCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAA
ACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGT
TCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG
Deoxyribonucleic acid (DNA) is the chemical
inside the nucleus of all cells that carries the
genetic instructions for making living
organisms. A DNA molecule consists of two
strands that wrap around each other to resemble
a twisted ladder.
Genome Simulation
Human DNA
In a perfect world (just your 3 billion letters):
~700 megabytes
In the real world, right off the genome sequencer:
100~200 gigabytes
Genome Simulation
Human Genome Sequencing
Simulation and the need for IT
Genome Simulation
Genome Simulation
Human Genome – Alignment – Variant Calling
• Be able to Simulate high-throughput Genome sequencing.
• The Genome generated can be used to test existing pipelines and
infrastructure.
• If run in a distributed mode can generate sufficient data to create
near production line scenarios, the data could be used to test the
ingestion and processing through the pipeline and subsequent
analytic tools.
• Synthetic so no issues with privacy, patient de – identification,
transfer across regions.
Genome Simulation
Why Simulate Genomic Data
Our Little Experiment
Genome Simulation
• VarSim was picked as the tool for genome simulation, it provided
ways were variations could be introduced in a random fashion into
Genome Simulation and the output FASTQ and VCF files would be unique
from each other
• Other tools were considered, but were not maintained or did not
provide sufficient flexibility.
Genome Simulation
Tool Selection
References
John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark
B. Gerstein, Alexej Abyzov, Wing H. Wong, and Hugo Y.K. Lam
VarSim: A high-fidelity simulation and validation framework for
high-throughput genome sequencing with cancer applications
Bioinformatics first published online December 17,
2014doi:10.1093/bioinformatics/btu828
Summary:
VarSim is a framework for assessing alignment and
variant calling accuracy in high- throughput genome
sequencing through simulation or real data. In contrast
to simulating a raNdom mutation spectrum, it
synthesizes diploid genomes with germline and somatic
mutations based on a realistic model. This model
leverages information such as previously reported
mutations to make the synthetic genomes biologically
relevant.
Genome Simulation
VarSim
• VarSim is a Python/Java based tool that would simulate one genome per run.
• We looked into ways were we could parallelize the to generate more
genomes.
• Build Docker container/Image for the tool.
• Experiment run on Google Cloud – Container engine.
• Parameters like Coverage, Unique ID & Seed value were externalized in a
Lambda function that the Docker images could talk to and receive arguments
before execution.
• Output FASTQ files and VCF’s would then be stored in Cloud storage.
• Ability to choose to generate FASTQ & VCF or just VCF.
Genome Simulation
Technicalities
Start Script
Google Cloud Libraries
Java Libraries
SAM Tools
VarSim
Python Libraries
Ref Genome / Insert Sequences / Annotations
ART Simulator
4.8 GB
CLOUD FUNCTIONS BETA
HTTP
DOCKER IMAGE
OUTPUT FILESFASTQ VCF
Genome Simulation
Docker Image
Using Docker
Registry on Google
Cloud.
Given the size of
the Docker Image it
made sense to take
advantage of the
high Network speeds
between servers on
the cloud for quick
deployment to the
Container Engine.
Genome Simulation
Container Registry
DOCKER IMAGE4.8 GB
Using Container
Registry on Google
Cloud.
Genome Simulation
Container Clusters
Can Reach a MAX cluster size of 1000
Configure the version of Docker image to be deployed to the cluster.
Kubernetes takes care of distributing the Docker image to all the instances
in the cluster.
Genome Simulation
Kubernetes - Container Clusters
Genome Simulation
Architecture
• We have around 1000
unique VCF files
generated so far.
• We have around 10 Genome
FASTQ and VCF’s.
• More FASTQ and VCF if we
can fund it.
Cost• Cost $1000 to generate 1000 unique VCF files. 1$ per VCF.
• Cost’s to generate FASTQ files vary based on the coverage required. For
a 50X coverage the costs work out around 5$ for the FASTQ and VCF. The
costs can be brought down by generating FATSQ files in multiple lanes.
Genome Simulation
Outcome
John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H.
Wong, and Hugo Y.K. Lam
VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing
with cancer applications
Bioinformatics first published online December 17, 2014doi:10.1093/bioinformatics/btu828
To folks from Google - Daniel Bergqvist, Nico Gaviola & Craig Box.
To Nick Brown, Rob Hernandez, Sandra Giuliani, Frank Lombardi from AstraZeneca and Mathew Woodwark
from MedImmune for sponsoring this work
Genome Simulation
References
Thanks
Thank You
Genome Simulation