Using apache spark to fight world hunger - spark meetup

download Using apache spark to fight world hunger - spark meetup

If you can't read please download the document

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Using apache spark to fight world hunger - spark meetup

Spark Meetup, December 2015Noam


Food shortage: new problems, new solutionsIntermezzo: how DNA worksTachles: what we do with Apache Spark

The planet has gotten very populous

And its the only one we got

World PopulationAnnual Growth Rate:Peak - 2.1% (1962)Current - 1.1% (2009)

Food intakesource:

Upscale: Same area, more crops

Plant breedingAn ancient artIncremental changesSlow but considerable


How long does it take today?

Maize: 10-15 yearssource:

How breeding works123451234512345123451234512345123451234512345

Computational genomicsPrices of DNA sequencingNumber of samples per crop sequenced and analyzedAmount and quality of genomic dataPrices of computationPrices of storageWere entering a new eraBIG DATA Genomics

Food security - a computational problem?The plants potential lies in its DNA.We analyze and compare sequences from many plants.Resulting in better predictions for breeding.Faster rate of crop improvement.

Intermezzo: DNA - how does it work?Four letters: cytosine(C), guanine(G), adenine(A), thymine(T)Encode 20 amino acidsCombine to make:+100K proteins

Conceptually we can think of this as a pipeline:The Central Dogma

DNA as storageDurableSupports random accessEfficient sequential readsEasily replicatedContains error correction mechanismsMaximally data local

Part 2: What we do with Analyze lots of genome sequences.Apply similarity algorithms, find where they match.Finally, assist the breeding program.

Input data is noisyContains errors and gaps.Is fragmented.All due to sequencing technology.

Our setupHadoop clusters on both private cloud and AWSTextual files, using Parquet.MapR 5 Hadoop distroSpark 1.4.1SparkSQL and Hive (JDBC)Instances: ~150GB RAM, 40 cores.Provisioning: Ansible

Our dataA dozen or so different crops, going for hundreds.Each crop: potentially ~1K fully sequenced samples ~100K markers.Each sequence: 1Gbp - 10Gbp (giga base-pairs = characters) longCurrent: several terabytes, aiming at petabytes

Working with Spark and ScalaScalas type system is your friendThinking functional takes time - and can be overdoneRemember to add @tailrec when neededScala case classes - greatNested structure: keeps you DRY, but sluggish.Scala has its pitfalls - profile.Spark as the ultimate scala collection - Martin Odersky.

Complex unmanaged framework - the usual 20/80 rule: 20% fun algorithmic stuff,80% integration/devops/tuning/black-voodooIntegration with Hive - doable but cumbersomeDataFrames API - very cleanParquet in Spark 1.4 - seamless, Parquet with SparkSQL < 1.3 - rather sucks.

Integrations with Spark

If RDD objects need high RAM memory gets tricky.Spark UI in 1.4.1 - very nicePairRDD - need to be your own query optimizer repartition / coalesce - very useful, but gets tricky if data variability is high (a dynamic real-time optimizer would be great).

Performance tuning with Spark

Testing: local is great, but means no unit-test :-(sbt-pack - good alternative to sbt-assembly.Spark packages: spark-csv, spark-notebook and more.Speaking of open-source packages...Testing, packaging and extending Spark

ADAM Project - Genomics using SparkFully open sourced fromSimilarity algorithmsPopulation clusteringPredictive analysis using Deep LearningAnd more

Spark Meetup, December 2015Noam

Thank you