Lightning fast genomics with Spark, Adam and Scala

50
Lightning fast genomics With Spark and ADAM

description

We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.

Transcript of Lightning fast genomics with Spark, Adam and Scala

Page 1: Lightning fast genomics with Spark, Adam and Scala

Lightning fast genomicsWith Spark and ADAM

Page 2: Lightning fast genomics with Spark, Adam and Scala

Andy

@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool

Who are we?

Xavier

@xtordoirSilicoCloud-> Physics

-> Data analysis -> genomics

-> scalable systems-> ...

Page 3: Lightning fast genomics with Spark, Adam and Scala

Genomics

What is genomics about?

Medical Diagnostics

Drug response

Diseases mechanisms

Page 4: Lightning fast genomics with Spark, Adam and Scala

Genomics

What is genomics about?- A human genome is a 3 billion long sequence (of

nucleic acids: “bases”)

- 1 per 1000 base is variable in human population

- Genomes encode bio-molecules (tens of thousands)

- These molecules interact together

...and with environment

→ Biological systems are very complex

Page 5: Lightning fast genomics with Spark, Adam and Scala

Genomics

State of the art- growing technological capacity

- cost reduction

- growing data._

Page 6: Lightning fast genomics with Spark, Adam and Scala

Genomics

State of the art- I.T. becomes bottleneck (cost and latency)

- sacrifice data with sampling or cut-offsAndrea Sboner et al

Page 7: Lightning fast genomics with Spark, Adam and Scala

Genomics

Blocking points

- “legacy stack” not designed scalable (C, perl, …)

- HPC approach not a fit (data intensive)

Page 8: Lightning fast genomics with Spark, Adam and Scala

Genomics

Future of genomics

- Personal genomes (e.g. 1,000,000 genomes for cancer

research)

- New sequencing technologies

- Sequence “stuff” as needed (e.g. microbiome,

diagnostics)

- medicalCondition = f(genomics, environmentHistory)

Page 9: Lightning fast genomics with Spark, Adam and Scala

Genomics

Needs of scalability → Scala & Spark

Needs of simplicity, clarity → ADAM

Page 10: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Columnar storage

Row oriented

Column oriented

Page 11: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Columnar storage

> Homogeneous collocated data> Better range access> Better encoding

Page 12: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Efficient encoding of nested typed structures

message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}

Page 13: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Efficient encoding of nested typed structures

message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}

Nested structure →Tree

Empty levels →Branch pruning

Repetitions →Metadata (index)

Types → Safe/Fast codec

Page 14: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Efficient encoding of nested typed structures

ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Page 15: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Optimized distributed storage (f.i. in HDFS)

ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

Page 16: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Efficient (schema based) serialization: AVRO

{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}

JSON Schema IDL

record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null;}

Page 17: Lightning fast genomics with Spark, Adam and Scala

Parquet 101

Efficient (schema based) serialization: AVRO

{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}

JSON Schema Part of the:● protocol● serialization

→less metadata

Define: IDL → JSONSend: Binary → JSON

Page 18: Lightning fast genomics with Spark, Adam and Scala

ADAM

Credits: AmpLab (UC Berkeley)

Page 19: Lightning fast genomics with Spark, Adam and Scala

ADAM

Overview (Sequencing)

- DNA is a molecule

…or a Seq[Char] (A, T, G, C) alphabet

Page 20: Lightning fast genomics with Spark, Adam and Scala

ADAM

Sequencing

- Massively parallel sequencing of random 100-150

bases reads (20,000,000 reads per genome)

- 30-60x coverage for quality

- All this mess must be re-organised!

→ ADAM

Page 21: Lightning fast genomics with Spark, Adam and Scala

ADAM

Variants Calling

- From an organized set of reads (ADAM Pileup)

- Detect variants (Variant Calling)

→ AVOCADO

Page 22: Lightning fast genomics with Spark, Adam and Scala

ADAM

Genomics specifications

- SAM, BAM, VCF

- Indexable

- libraries

- ~ scalable: hadoop-bam

Page 23: Lightning fast genomics with Spark, Adam and Scala

ADAM

ADAM model- schema based (Avro), libraries are generated

- no storage spec here!

Page 24: Lightning fast genomics with Spark, Adam and Scala

ADAM

ADAM model

- Parquet storage- evenly distribute data

- storage optimized for read/query

- better compression

Page 25: Lightning fast genomics with Spark, Adam and Scala

ADAM

ADAM API- AdamContext provides functions to read from HDFS

Page 26: Lightning fast genomics with Spark, Adam and Scala

ADAM

ADAM API

- Scala classes generated from Avro

- Data loaded as RDDs (Spark’s Resilient Distributed

Datasets)

- functions on RDDs (write to HDFS, genomic objects

manipulations)

Page 27: Lightning fast genomics with Spark, Adam and Scala

ADAM

ADAM API

- e.g. reading genotypes

Page 28: Lightning fast genomics with Spark, Adam and Scala

ADAM

ADAM Benchmark- It scales!- Data is more compact- Read perf is better- Code is simpler

Page 29: Lightning fast genomics with Spark, Adam and Scala

As usual… let’s get some data.

Genomes relate to health and are private.

Still, there are options!

Stratification using 1000Genomes

Page 30: Lightning fast genomics with Spark, Adam and Scala

Stratification using 1000Genomes

http://www.1000genomes.org/(Nowadays targeting 2000 genomes)

ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg

Page 31: Lightning fast genomics with Spark, Adam and Scala

Stratification using 1000Genomes

Page 32: Lightning fast genomics with Spark, Adam and Scala

Stratification using 1000Genomes

Page 33: Lightning fast genomics with Spark, Adam and Scala

Stratification using 1000Genomes

Study genetic variations in populations (needs more contextual data for healthcare).

To validate the interest in ADAM, we’ll do some qualitative exploration of the data.

Question: it is possible to predict the appartenance of a given genome to a subpopulation?

Page 34: Lightning fast genomics with Spark, Adam and Scala

We can run an unsupervised algorithm on a massive number of genomes.

The idea is to find clusters that would match subpopulations.

Stratification using 1000Genomes

Actually, it’s important because it reflects populations histories: gene flows, selection, ...

Page 35: Lightning fast genomics with Spark, Adam and Scala

Stratification using 1000Genomes

From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants

ref: http://en.wikipedia.org/wiki/Chromosome

Page 36: Lightning fast genomics with Spark, Adam and Scala

Genome Data

Data structure

Page 37: Lightning fast genomics with Spark, Adam and Scala

Genome Data

Data structure

Panel: Map[SampleID, Population]

Page 38: Lightning fast genomics with Spark, Adam and Scala

Genome Data

Data structureGenotypes in VCF format

Basically a text file. Ours were downloaded from S3.

Converted to ADAM Genotypes

Page 39: Lightning fast genomics with Spark, Adam and Scala

Machine Learning model

Clustering: KMeans

ref: http://en.wikipedia.org/wiki/K-means_clustering

Page 40: Lightning fast genomics with Spark, Adam and Scala

Machine Learning model

Clustering: KMeans

ref: http://en.wikipedia.org/wiki/K-means_clustering

PreProcess = {A,C,T,G}² → {0,1,2}

Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰

Distance = Euclidian (L2) ⁽*⁾

⁽*⁾MLlib restriction, although, here: L2~L1SPARK-3012

Page 41: Lightning fast genomics with Spark, Adam and Scala

Machine Learning model

MLLib, KMeans

MLLib: ● Machine Learning Algorithms● Data structures (e.g. Vector)

Page 42: Lightning fast genomics with Spark, Adam and Scala

Machine Learning model

MLLib KMeans

DataFrame Map: ● key = Sample● value = Vector of Genotypes alleles (sorted by Variant)

Page 43: Lightning fast genomics with Spark, Adam and Scala

Mashup

prediction

Sample [NA20332] is in cluster #0 for population Some(ASW)

Sample [NA20334] is in cluster #2 for population Some(ASW)

Sample [HG00120] is in cluster #2 for population Some(GBR)

Sample [NA18560] is in cluster #1 for population Some(CHB)

Page 44: Lightning fast genomics with Spark, Adam and Scala

Mashup

#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0

Page 45: Lightning fast genomics with Spark, Adam and Scala

Cluster

4 m3.xlarge instances (ec2)16 cores + 60G

Page 46: Lightning fast genomics with Spark, Adam and Scala

Cluster

Performances

Page 47: Lightning fast genomics with Spark, Adam and Scala

Cluster

40 m3.xlarge160 cores + 600G

Page 48: Lightning fast genomics with Spark, Adam and Scala

Conclusions and future work

● ADAM and Spark provide tools to manipulate genomics data in a scalable way

● Simple APIs in Scala● MLLib for machine learning

→ implement less naïve algorithms→ cross medical and environmental data with genomes

Page 49: Lightning fast genomics with Spark, Adam and Scala

Acknowledgements

Scala.IO

AmpLab Matt Massie Frank Nothaft

Vincent Botta

Acknowledgments

Page 50: Lightning fast genomics with Spark, Adam and Scala

That’s all Folks

Apparently, we’re supposed to stay on stageWaiting for questionsHoping for noneLooking at the barAnd the lunchOh there are beersAnd candies

who can read this?