Lightning fast genomics with Spark, Adam and Scala

download Lightning fast genomics with Spark, Adam and Scala

of 50

  • date post

  • Category


  • view

  • download


Embed Size (px)


We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.

Transcript of Lightning fast genomics with Spark, Adam and Scala

  • 1. Lightning fast genomicsWith Spark and ADAM

2. Who are we?Andy@Noootsab@NextLab_be@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFoolXavier@xtordoirSilicoCloud-> Physics-> Data analysis-> genomics-> scalable systems-> ... 3. GenomicsWhat is genomics about?Medical DiagnosticsDrug responseDiseases mechanisms 4. GenomicsWhat is genomics about?- A human genome is a 3 billion long sequence (ofnucleic acids: bases)- 1 per 1000 base is variable in human population- Genomes encode bio-molecules (tens of thousands)- These molecules interact together...and with environment Biological systems are very complex 5. GenomicsState of the art- growing technological capacity- cost reduction- growing data._ 6. GenomicsState of the art- I.T. becomes bottleneck (cost and latency)- sacrifice data with sampling or cut-offsAndrea Sboner et al 7. GenomicsBlocking points- legacy stack not designed scalable (C, perl, )- HPC approach not a fit (data intensive) 8. GenomicsFuture of genomics- Personal genomes (e.g. 1,000,000 genomes for cancerresearch)- New sequencing technologies- Sequence stuff as needed (e.g. microbiome,diagnostics)- medicalCondition = f(genomics, environmentHistory) 9. GenomicsNeeds of scalability Scala & SparkNeeds of simplicity, clarity ADAM 10. Parquet 101Columnar storageRow orientedColumn oriented 11. Parquet 101Columnar storage> Homogeneous collocated data> Better range access> Better encoding 12. Parquet 101Efficient encoding of nested typed structuresmessage Document {required int64 DocId;optional group Links {repeated int64 Backward;repeated int64 Forward;}repeated group Name {repeated group Language {required string Code;optional string Country;}optional string Url;}} 13. Parquet 101Efficient encoding of nested typed structuresmessage Document {required int64 DocId;optional group Links {repeated int64 Backward;repeated int64 Forward;}repeated group Name {repeated group Language {required string Code;optional string Country;}optional string Url;}}Nested structure TreeEmpty levels Branch pruningRepetitions Metadata (index)Types Safe/Fast codec 14. Parquet 101Efficient encoding of nested typed structuresref: 15. Parquet 101Optimized distributed storage (f.i. in HDFS)ref: 16. Parquet 101Efficient (schema based) serialization: AVROJSON Schema IDL{"namespace": "example.avro","type": "record","name": "User","fields": [{"name": "name", "type": "string"},{"name": "favorite_number", "type": ["int", "null"]},{"name": "favorite_color", "type": ["string", "null"]}]}record User {string name;union { null, int } favorite_number = null;union { null, string } favorite_color = null;} 17. Parquet 101Efficient (schema based) serialization: AVROJSON Schema Part of the:{"namespace": "example.avro","type": "record","name": "User","fields": [{"name": "name", "type": "string"},{"name": "favorite_number", "type": ["int", "null"]},{"name": "favorite_color", "type": ["string", "null"]}]} protocol serializationless metadataDefine: IDL JSONSend: Binary JSON 18. ADAMCredits: AmpLab (UC Berkeley) 19. ADAMOverview (Sequencing)- DNA is a moleculeor a Seq[Char](A, T, G, C) alphabet 20. ADAMSequencing- Massively parallel sequencing of random 100-150bases reads (20,000,000 reads per genome)- 30-60x coverage for quality- All this mess must be re-organised! ADAM 21. ADAMVariants Calling- From an organized set of reads (ADAM Pileup)- Detect variants (Variant Calling) AVOCADO 22. ADAMGenomics specifications- SAM, BAM, VCF- Indexable- libraries- ~ scalable: hadoop-bam 23. ADAMADAM model- schema based (Avro), libraries are generated- no storage spec here! 24. ADAMADAM model- Parquet storage- evenly distribute data- storage optimized for read/query- better compression 25. ADAMADAM API- AdamContext provides functions to read from HDFS 26. ADAMADAM API- Scala classes generated from Avro- Data loaded as RDDs (Sparks Resilient DistributedDatasets)- functions on RDDs (write to HDFS, genomic objectsmanipulations) 27. ADAMADAM API- e.g. reading genotypes 28. ADAMADAM Benchmark- It scales!- Data is more compact- Read perf is better- Code is simpler 29. Stratification using 1000GenomesAs usual lets get some data.Genomes relate to health and are private.Still, there are options! 30. Stratification using 1000Genomes targeting 2000 genomes)ref: 31. Stratification using 1000Genomes 32. Stratification using 1000Genomes 33. Stratification using 1000GenomesStudy genetic variations in populations (needsmore contextual data for healthcare).To validate the interest in ADAM, well do somequalitative exploration of the data.Question: it is possible to predict theappartenance of a given genome to asubpopulation? 34. Stratification using 1000GenomesWe can run an unsupervised algorithm on amassive number of genomes.The idea is to find clusters that would matchsubpopulations.Actually, its important because it reflectspopulations histories: gene flows, selection, ... 35. Stratification using 1000GenomesFrom the 200Tb of data, well focus on the 6thchromosome, actually only its variantsref: 36. Genome DataData structure 37. Genome DataData structurePanel: Map[SampleID, Population] 38. Genome DataData structureGenotypes in VCF formatBasically a text file. Ours were downloaded from S3.Converted to ADAM Genotypes 39. Machine Learning modelClustering: KMeansref: 40. Machine Learning modelClustering: KMeansPreProcess = {A,C,T,G} {0,1,2}Space = {0,1,2}Distance = Euclidian (L2) **MLlib restriction, although, here: L2~L1SPARK-3012ref: 41. Machine Learning modelMLLib, KMeansMLLib: Machine Learning Algorithms Data structures (e.g. Vector) 42. Machine Learning modelMLLib KMeansDataFrame Map: key = Sample value = Vector of Genotypes alleles (sorted by Variant) 43. MashuppredictionSample [NA20332] is in cluster #0 for population Some(ASW)Sample [NA20334] is in cluster #2 for population Some(ASW)Sample [HG00120] is in cluster #2 for population Some(GBR)Sample [NA18560] is in cluster #1 for population Some(CHB) 44. Mashup#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0 45. Cluster4 m3.xlarge instances (ec2)16 cores + 60G 46. ClusterPerformances 47. Cluster40 m3.xlarge160 cores + 600G 48. Conclusions and future work ADAM and Spark provide tools tomanipulate genomics data in a scalable way Simple APIs in Scala MLLib for machine learning implement less nave algorithms cross medical and environmental data withgenomes 49. AcknowledgmentsAcknowledgementsScala.IOAmpLabMatt Massie Frank NothaftVincent Botta 50. Thats all FolksApparently, were supposed to stay on stageWaiting for questionsHoping for noneLooking at the barAnd the lunchOh there are beersAnd candieswho can read this?