Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

21
Genome (FASTQ and VCF) Simulation & Applications Hariprasad Radhakrishnan AstraZeneca, Technology Labs, UK

Transcript of Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Page 1: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Genome (FASTQ and VCF) Simulation & Applications

Hariprasad RadhakrishnanAstraZeneca, Technology Labs, UK

Page 2: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Genome Simulation

AstraZeneca

We are a global, science-led

biopharmaceutical business

pushing the boundaries of science

to deliver life-changing medicines.

61,500employees worldwide

$23bn2016 Revenue*

100+Countries

Page 3: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Our Experiment –

• Quick introduction to DNA - DNA Sequencing

• synthetically generated Genome data (FASTQ & VCF)!

• How to Scale/ run Distributed Compute using Kubernetes and

Docker

Hari works as an Associate Architect - Data &

Analytics in the UK Tech Incubation Lab of

AstraZeneca.

Genome Simulation

Introduction

Page 4: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

DNA & How it is Sequenced

Genome Simulation

Page 5: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCAGTCAGATTTACCCTGGCTCACCTTGGCGTCGCG

TCCGGCGGCAAACTAAGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCTCCTTCCAGCCTCCGAAAA

ACTCGGACCAAAGATCAGGCTTGTCCGTTCTTCGCTAGTGATGAGACTGCGCCTCTGTTCGTACAACCAATTTAGGTGAGT

TCAAACTTCAGGGTCCAGAGGCTGATAATCTACTTACCCAAACATAG

Deoxyribonucleic acid (DNA) is the chemical

inside the nucleus of all cells that carries the

genetic instructions for making living

organisms. A DNA molecule consists of two

strands that wrap around each other to resemble

a twisted ladder.

Genome Simulation

Human DNA

Page 6: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

In a perfect world (just your 3 billion letters):

~700 megabytes

In the real world, right off the genome sequencer:

100~200 gigabytes

Genome Simulation

Human Genome Sequencing

Page 7: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Simulation and the need for IT

Genome Simulation

Page 8: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Genome Simulation

Human Genome – Alignment – Variant Calling

Page 9: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

• Be able to Simulate high-throughput Genome sequencing.

• The Genome generated can be used to test existing pipelines and

infrastructure.

• If run in a distributed mode can generate sufficient data to create

near production line scenarios, the data could be used to test the

ingestion and processing through the pipeline and subsequent

analytic tools.

• Synthetic so no issues with privacy, patient de – identification,

transfer across regions.

Genome Simulation

Why Simulate Genomic Data

Page 10: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Our Little Experiment

Genome Simulation

Page 11: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

• VarSim was picked as the tool for genome simulation, it provided

ways were variations could be introduced in a random fashion into

Genome Simulation and the output FASTQ and VCF files would be unique

from each other

• Other tools were considered, but were not maintained or did not

provide sufficient flexibility.

Genome Simulation

Tool Selection

Page 12: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

References

John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark

B. Gerstein, Alexej Abyzov, Wing H. Wong, and Hugo Y.K. Lam

VarSim: A high-fidelity simulation and validation framework for

high-throughput genome sequencing with cancer applications

Bioinformatics first published online December 17,

2014doi:10.1093/bioinformatics/btu828

Summary:

VarSim is a framework for assessing alignment and

variant calling accuracy in high- throughput genome

sequencing through simulation or real data. In contrast

to simulating a raNdom mutation spectrum, it

synthesizes diploid genomes with germline and somatic

mutations based on a realistic model. This model

leverages information such as previously reported

mutations to make the synthetic genomes biologically

relevant.

Genome Simulation

VarSim

Page 13: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

• VarSim is a Python/Java based tool that would simulate one genome per run.

• We looked into ways were we could parallelize the to generate more

genomes.

• Build Docker container/Image for the tool.

• Experiment run on Google Cloud – Container engine.

• Parameters like Coverage, Unique ID & Seed value were externalized in a

Lambda function that the Docker images could talk to and receive arguments

before execution.

• Output FASTQ files and VCF’s would then be stored in Cloud storage.

• Ability to choose to generate FASTQ & VCF or just VCF.

Genome Simulation

Technicalities

Page 14: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Start Script

Google Cloud Libraries

Java Libraries

SAM Tools

VarSim

Python Libraries

Ref Genome / Insert Sequences / Annotations

ART Simulator

4.8 GB

CLOUD FUNCTIONS BETA

HTTP

DOCKER IMAGE

OUTPUT FILESFASTQ VCF

Genome Simulation

Docker Image

Page 15: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Using Docker

Registry on Google

Cloud.

Given the size of

the Docker Image it

made sense to take

advantage of the

high Network speeds

between servers on

the cloud for quick

deployment to the

Container Engine.

Genome Simulation

Container Registry

DOCKER IMAGE4.8 GB

Page 16: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Using Container

Registry on Google

Cloud.

Genome Simulation

Container Clusters

Can Reach a MAX cluster size of 1000

Page 17: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Configure the version of Docker image to be deployed to the cluster.

Kubernetes takes care of distributing the Docker image to all the instances

in the cluster.

Genome Simulation

Kubernetes - Container Clusters

Page 18: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Genome Simulation

Architecture

Page 19: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

• We have around 1000

unique VCF files

generated so far.

• We have around 10 Genome

FASTQ and VCF’s.

• More FASTQ and VCF if we

can fund it.

Cost• Cost $1000 to generate 1000 unique VCF files. 1$ per VCF.

• Cost’s to generate FASTQ files vary based on the coverage required. For

a 50X coverage the costs work out around 5$ for the FASTQ and VCF. The

costs can be brought down by generating FATSQ files in multiple lanes.

Genome Simulation

Outcome

Page 20: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

John C. Mu, Marghoob Mohiyuddin, Jian Li, Narges Bani Asadi, Mark B. Gerstein, Alexej Abyzov, Wing H.

Wong, and Hugo Y.K. Lam

VarSim: A high-fidelity simulation and validation framework for high-throughput genome sequencing

with cancer applications

Bioinformatics first published online December 17, 2014doi:10.1093/bioinformatics/btu828

To folks from Google - Daniel Bergqvist, Nico Gaviola & Craig Box.

To Nick Brown, Rob Hernandez, Sandra Giuliani, Frank Lombardi from AstraZeneca and Mathew Woodwark

from MedImmune for sponsoring this work

Genome Simulation

References

Thanks

Page 21: Genome Simulation & Applications: Use of Managed Distributed Compute Infrastructure

Thank You

Genome Simulation