Lightning fast genomics with Spark, Adam and Scala

Lightning fast genomicsWith Spark and ADAM

Andy

@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool

Who are we?

Xavier

@xtordoirSilicoCloud-> Physics

-> Data analysis -> genomics

-> scalable systems-> ...

Genomics

What is genomics about?

Medical Diagnostics

Drug response

Diseases mechanisms

Genomics

What is genomics about?- A human genome is a 3 billion long sequence (of

nucleic acids: “bases”)

- 1 per 1000 base is variable in human population

- Genomes encode bio-molecules (tens of thousands)

- These molecules interact together

...and with environment

→ Biological systems are very complex

Genomics

State of the art- growing technological capacity

- cost reduction

- growing data._

Genomics

State of the art- I.T. becomes bottleneck (cost and latency)

- sacrifice data with sampling or cut-offsAndrea Sboner et al

Genomics

Blocking points

- “legacy stack” not designed scalable (C, perl, …)

- HPC approach not a fit (data intensive)

Genomics

Future of genomics

- Personal genomes (e.g. 1,000,000 genomes for cancer

research)

- New sequencing technologies

- Sequence “stuff” as needed (e.g. microbiome,

diagnostics)

- medicalCondition = f(genomics, environmentHistory)

Genomics

Needs of scalability → Scala & Spark

Needs of simplicity, clarity → ADAM

Parquet 101

Columnar storage

Row oriented

Column oriented

Parquet 101

Columnar storage

> Homogeneous collocated data> Better range access> Better encoding

Parquet 101

Efficient encoding of nested typed structures

message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}

Parquet 101


message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}

Nested structure →Tree

Empty levels →Branch pruning

Repetitions →Metadata (index)

Types → Safe/Fast codec

Parquet 101


ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Parquet 101

Optimized distributed storage (f.i. in HDFS)

ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

Parquet 101

Efficient (schema based) serialization: AVRO

{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}

JSON Schema IDL

record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null;}

Parquet 101

Efficient (schema based) serialization: AVRO

{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}

JSON Schema Part of the:● protocol● serialization

→less metadata

Define: IDL → JSONSend: Binary → JSON

ADAM

Credits: AmpLab (UC Berkeley)

ADAM

Overview (Sequencing)

- DNA is a molecule

…or a Seq[Char] (A, T, G, C) alphabet

ADAM

Sequencing

- Massively parallel sequencing of random 100-150

bases reads (20,000,000 reads per genome)

- 30-60x coverage for quality

- All this mess must be re-organised!

→ ADAM

ADAM

Variants Calling

- From an organized set of reads (ADAM Pileup)

- Detect variants (Variant Calling)

→ AVOCADO

ADAM

Genomics specifications

- SAM, BAM, VCF

- Indexable

- libraries

- ~ scalable: hadoop-bam

ADAM

ADAM model- schema based (Avro), libraries are generated

- no storage spec here!

ADAM

ADAM model

- Parquet storage- evenly distribute data

- storage optimized for read/query

- better compression

ADAM

ADAM API- AdamContext provides functions to read from HDFS

ADAM

ADAM API

- Scala classes generated from Avro

- Data loaded as RDDs (Spark’s Resilient Distributed

Datasets)

- functions on RDDs (write to HDFS, genomic objects

manipulations)

ADAM

ADAM API

- e.g. reading genotypes

ADAM

ADAM Benchmark- It scales!- Data is more compact- Read perf is better- Code is simpler

As usual… let’s get some data.

Genomes relate to health and are private.

Still, there are options!

Stratification using 1000Genomes


http://www.1000genomes.org/(Nowadays targeting 2000 genomes)

ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg

http://www.1000genomes.org/

http://www.1000genomes.org/

http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg


http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/


Study genetic variations in populations (needs more contextual data for healthcare).

To validate the interest in ADAM, we’ll do some qualitative exploration of the data.

Question: it is possible to predict the appartenance of a given genome to a subpopulation?

We can run an unsupervised algorithm on a massive number of genomes.

The idea is to find clusters that would match subpopulations.


Actually, it’s important because it reflects populations histories: gene flows, selection, ...


From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants

ref: http://en.wikipedia.org/wiki/Chromosome

http://en.wikipedia.org/wiki/Chromosome

Genome Data

Data structure

Genome Data

Data structure

Panel: Map[SampleID, Population]

Genome Data

Data structureGenotypes in VCF format

Basically a text file. Ours were downloaded from S3.

Converted to ADAM Genotypes

Machine Learning model

Clustering: KMeans

ref: http://en.wikipedia.org/wiki/K-means_clustering

http://en.wikipedia.org/wiki/K-means_clustering


Clustering: KMeans

ref: http://en.wikipedia.org/wiki/K-means_clustering

PreProcess = {A,C,T,G}² → {0,1,2}

Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰

Distance = Euclidian (L2) ⁽*⁾

⁽*⁾MLlib restriction, although, here: L2~L1SPARK-3012

http://en.wikipedia.org/wiki/K-means_clustering

https://github.com/apache/spark/pull/1964#issuecomment-59885635

https://github.com/apache/spark/pull/1964#issuecomment-59885635


MLLib, KMeans

MLLib: ● Machine Learning Algorithms● Data structures (e.g. Vector)


MLLib KMeans

DataFrame Map: ● key = Sample● value = Vector of Genotypes alleles (sorted by Variant)

Mashup

prediction

Sample [NA20332] is in cluster #0 for population Some(ASW)

Sample [NA20334] is in cluster #2 for population Some(ASW)

Sample [HG00120] is in cluster #2 for population Some(GBR)

Sample [NA18560] is in cluster #1 for population Some(CHB)

Mashup

#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0

Cluster

4 m3.xlarge instances (ec2)16 cores + 60G

Cluster

Performances

Cluster

40 m3.xlarge160 cores + 600G

Conclusions and future work

● ADAM and Spark provide tools to manipulate genomics data in a scalable way

● Simple APIs in Scala● MLLib for machine learning

→ implement less naïve algorithms→ cross medical and environmental data with genomes

Acknowledgements

Scala.IO

AmpLab Matt Massie Frank Nothaft

Vincent Botta

Acknowledgments

That’s all Folks

Apparently, we’re supposed to stay on stageWaiting for questionsHoping for noneLooking at the barAnd the lunchOh there are beersAnd candies

who can read this?

Lightning fast genomics with Spark, Adam and Scala

Technology

Transcript of Lightning fast genomics with Spark, Adam and Scala