Lightning fast genomics with Spark, Adam and Scala

Lightning fast genomicsWith Spark and ADAM

@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool

Who are we?

Xavier

@xtordoirSilicoCloud-> Physics

-> Data analysis -> genomics

-> scalable systems-> ...

Genomics

What is genomics about?

Medical Diagnostics

Drug response

Diseases mechanisms

Genomics

What is genomics about?- A human genome is a 3 billion long sequence (of

nucleic acids: “bases”)

- 1 per 1000 base is variable in human population

- Genomes encode bio-molecules (tens of thousands)

- These molecules interact together

...and with environment

→ Biological systems are very complex

Genomics

State of the art- growing technological capacity

- cost reduction

- growing data._

Genomics

State of the art- I.T. becomes bottleneck (cost and latency)

- sacrifice data with sampling or cut-offsAndrea Sboner et al

Genomics

Blocking points

- “legacy stack” not designed scalable (C, perl, …)

- HPC approach not a fit (data intensive)

Genomics

Future of genomics

- Personal genomes (e.g. 1,000,000 genomes for cancer

research)

- New sequencing technologies

- Sequence “stuff” as needed (e.g. microbiome,

diagnostics)

- medicalCondition = f(genomics, environmentHistory)

Genomics

Needs of scalability → Scala & Spark

Needs of simplicity, clarity → ADAM

Parquet 101

Columnar storage

Row oriented

Column oriented

Parquet 101

Columnar storage

> Homogeneous collocated data> Better range access> Better encoding

Parquet 101

Efficient encoding of nested typed structures

message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}

Parquet 101

message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; }}

Nested structure →Tree

Empty levels →Branch pruning

Repetitions →Metadata (index)

Types → Safe/Fast codec

Parquet 101

ref: https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Parquet 101

Optimized distributed storage (f.i. in HDFS)

ref: http://grepalex.com/2014/05/13/parquet-file-format-and-object-model/

Parquet 101

Efficient (schema based) serialization: AVRO

{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}

JSON Schema IDL

record User { string name; union { null, int } favorite_number = null; union { null, string } favorite_color = null;}

Parquet 101

Efficient (schema based) serialization: AVRO

{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ]}

JSON Schema Part of the:● protocol● serialization

→less metadata

Define: IDL → JSONSend: Binary → JSON

Credits: AmpLab (UC Berkeley)

Overview (Sequencing)

- DNA is a molecule

…or a Seq[Char] (A, T, G, C) alphabet

Sequencing

- Massively parallel sequencing of random 100-150

bases reads (20,000,000 reads per genome)

- 30-60x coverage for quality

- All this mess must be re-organised!

→ ADAM

Variants Calling

- From an organized set of reads (ADAM Pileup)

- Detect variants (Variant Calling)

→ AVOCADO

Genomics specifications

- SAM, BAM, VCF

- Indexable

- libraries

- ~ scalable: hadoop-bam

ADAM model- schema based (Avro), libraries are generated

- no storage spec here!

ADAM model

- Parquet storage- evenly distribute data

- storage optimized for read/query

- better compression

ADAM API- AdamContext provides functions to read from HDFS

ADAM API

- Scala classes generated from Avro

- Data loaded as RDDs (Spark’s Resilient Distributed

Datasets)

- functions on RDDs (write to HDFS, genomic objects

manipulations)

ADAM API

- e.g. reading genotypes

ADAM Benchmark- It scales!- Data is more compact- Read perf is better- Code is simpler

As usual… let’s get some data.

Genomes relate to health and are private.

Still, there are options!

Stratification using 1000Genomes

http://www.1000genomes.org/(Nowadays targeting 2000 genomes)

ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg

Study genetic variations in populations (needs more contextual data for healthcare).

To validate the interest in ADAM, we’ll do some qualitative exploration of the data.

Question: it is possible to predict the appartenance of a given genome to a subpopulation?

We can run an unsupervised algorithm on a massive number of genomes.

The idea is to find clusters that would match subpopulations.

Actually, it’s important because it reflects populations histories: gene flows, selection, ...

From the 200Tb of data, we’ll focus on the 6th chromosome, actually only its variants

ref: http://en.wikipedia.org/wiki/Chromosome

Genome Data

Data structure

Genome Data

Data structure

Panel: Map[SampleID, Population]

Genome Data

Data structureGenotypes in VCF format

Basically a text file. Ours were downloaded from S3.

Converted to ADAM Genotypes

Machine Learning model

Clustering: KMeans

ref: http://en.wikipedia.org/wiki/K-means_clustering

Clustering: KMeans

ref: http://en.wikipedia.org/wiki/K-means_clustering

PreProcess = {A,C,T,G}² → {0,1,2}

Space = {0,1,2}¹⁷⁰⁰⁰⁰⁰⁰⁰

Distance = Euclidian (L2) ⁽*⁾

⁽*⁾MLlib restriction, although, here: L2~L1SPARK-3012

MLLib, KMeans

MLLib: ● Machine Learning Algorithms● Data structures (e.g. Vector)

MLLib KMeans

DataFrame Map: ● key = Sample● value = Vector of Genotypes alleles (sorted by Variant)

Mashup

prediction

Sample [NA20332] is in cluster #0 for population Some(ASW)

Sample [NA20334] is in cluster #2 for population Some(ASW)

Sample [HG00120] is in cluster #2 for population Some(GBR)

Sample [NA18560] is in cluster #1 for population Some(CHB)

Mashup

#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0

Cluster

4 m3.xlarge instances (ec2)16 cores + 60G

Cluster

Performances

Cluster

40 m3.xlarge160 cores + 600G

Conclusions and future work

● ADAM and Spark provide tools to manipulate genomics data in a scalable way

● Simple APIs in Scala● MLLib for machine learning

→ implement less naïve algorithms→ cross medical and environmental data with genomes

Acknowledgements

Scala.IO

AmpLab Matt Massie Frank Nothaft

Vincent Botta

Acknowledgments

That’s all Folks

Apparently, we’re supposed to stay on stageWaiting for questionsHoping for noneLooking at the barAnd the lunchOh there are beersAnd candies

who can read this?

Lightning fast genomics with Spark, Adam and Scala

Technology

Transcript of Lightning fast genomics with Spark, Adam and Scala

Scala By Example - The Scala Programming Language

Functional genomics approaches to disease genomics

LIGHTNING HOW LIGHTNING HURTS US

Lightning Protection Systems - Akihito Shigenoakihito-shigeno.com/files/Lightning_20Protection_20Systems_202005.pdf · Lightning Protection Systems I. What is Lightning/Lightning

Scala eXchange: Building robust data pipelines in Scala

Google Genomics Documentation - Read the Docsmedia.readthedocs.org/pdf/google-genomics/latest/google-genomics… · Google Genomics Documentation, Release v1 The 1000 Genomes Project

Lightning Component × Lightning Design System

· Web viewGenetics and Genomics. Genetics and Genomics. Genetics and Genomics. Genetics and Genomics. Cell cycle, divisions, gametogenesis. Mutations and polymorphisms. Cytogenetics.

Microbial genomics Genomics: study of entire genomes Logical next step after genetics: study of genes Genomics: 1) “Structural genomics” * Determine and.

Protection against lightning Lightning Arrestor

SCALA A 4 PIOLI PER USO DOMESTICO - ALDI · La scala non deve essere utilizzata come scala d’appoggio. max. Prima dell’uso, aprire completamente la scala. Utilizzare la scala

84 SCALA BROCHURE FRONT & BACK€¦ · SCALA SCALA . SCALA 20 . TOTAL 1 N FIN ed B TOTAL SCALA . Title: 84 SCALA BROCHURE FRONT & BACK Author: Arun Pawar Created Date: 7/13/2015 11:05:47

Structural Genomics, ISGO, and Structural Genomics Task Forces Open ISGO Structural Genomics Task Force Meeting ISGO International Structural Genomics.

scala-gopher: async implementation of CSP for scala

Fundamentos de Scala (Scala Basics) (español) Catecbol

SCALA DI DEFLUSSO - polito.it · SCALA DI DEFLUSSO LaLa scala scala didi deflusso,deflusso, oo “scala“scala delledelle portate”,portate”, esprimeesprime perper unauna datadata

Status – Week 278 Victor Moya. Lightning Diffuse Lightning. Diffuse Lightning. Light Sources. Light Sources. Specular Lightning. Specular Lightning. Emission.

External lightning protection Insulated lightning ...

Bootstrapping a Scala Mindset (Scala eXchange 2014)

LEADER, LIGHTNING, LIGHTNING PROTECTION LEADER, LIGHTNING, LIGHTNING PROTECTION E. Bazelyan and Yu. Raizer Solved and unsolved problems.