Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process...

26
Explore Spark for Metagenome assembly Zhong Wang, Ph.D. Group Lead, Genome Analysis

Transcript of Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process...

Page 1: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Explore Spark for Metagenome assembly

Zhong Wang, Ph.D. Group Lead, Genome Analysis

Page 2: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

National Microbiome Initiative

Page 3: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Environmental microbial communities are complex

1 10 100 1000 10000

Human Soil

Number of Species

Cow

>90%ofthespecieshaven’tbeenseenbefore

Page 4: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Decode Metagenome

Metagenome ShortReads

ShotgunSequencing

MetagenomeAssembly

AssembledGenomes

BillionsofpiecesTerabytesinsize“BigData”

LibraryofBooks ShreddedLibrary “reconstructed”Library

Genome~=BookMetagenome~=Library

Page 5: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

The Ideal Solution

•  Easytodevelop•  Robust•  Scaletobigdata•  Efficient

Page 6: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

2009: Special Hardware

Input/Output(IO)Memory

$1MOnlyscaleupto~100Gb

JeremyBrand@JGIFPGA@Convey

Page 7: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

2010: MP/MPI on supercomputers

•  ExperiencedsoFwareengineers•  SixmonthsofdevelopmentJme•  Onetaskfails,alltasksfail

Problems: Fast,scalable

RobEgan@JGI

MPI version412 Gb, 4.5B reads

2.7 hours on 128x24 coresNESRC Supercomputer

Page 8: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

2011: Hadoop/Map Reduce framework

•  Google MapReduce –  Data Parallel programming model to process petabyte

data –  Generally has a map and a reduce step

•  Apache Hadoop –  Distributed file system (HDFS) and job handling for

scalability and robustness –  Data locality to bring compute to data, avoiding

network transfer bottleneck

Page 9: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Programmability: Java vs Pig

findingouttop5websitesyoungpeoplevisit

Page 10: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

2013: BioPig

BioPig-Blaster

BioPig-Assembler

BioPig-Extender

BioPig:61linesofcodeMPI-extender:~12,000lines(vs31inBioPig)

Robust

Programmability

Scalability

xx

KaranBha?a,HenrikNordberg,KaiWang

BioPIG

Page 11: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Challenges in application

•  2-3 orders of magnitude slower than MPI •  IO optimization, e.g., reduce data copying •  Some problems do not easily fit into map/reduce

framework, e.g., graph-based algorithms •  Runs on AWS, but cost $$$ if not optimized

Page 12: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Optimizing BioPig

100 8 17

3 11 8

53

0

20

40

60

80

100

120

min

iute

s

Time saving (60GB)

0

50

100

150

200

250

1GB 10GB 20GB 40GB 60GB 100GB

Wal

ltim

e (M

inut

es)

Data Size (GB)

Baseline

Tuned

EMR

3XSpeedup

LizhenShi,WeikuanYu@FSU

SZllverylowefficiency!

Page 13: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Addressing big data: Apache Spark

•  Newscalableprogrammingparadigm•  CompaZblewithHadoop-supportedstoragesystems

•  Improvesefficiencythrough:•  In-memorycompuZngprimiZves•  GeneralcomputaZongraphs

•  Improvesusabilitythrough:•  RichAPIsinJava,Scala,Python•  InteracZveshell

Page 14: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Goal: Metagenome read clustering

•  Data characteristics: –  Total data size typically 100Gb – 1Tb –  >1 billion short pieces (reads, each 100-200bp) –  >1,000 different species, some species are more

similar than others –  Sequence errors 1-2%

•  Proposed approach: Divide-and-conquer – Cluster reads from each genome (Clustering) – Assemble each cluster in parallel (Assembly)

Page 15: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

•  Local information: overlap

•  Global information: covariance

Read clustering with Spark: idea

SparkMetagenomeMetagenomeAssembgenomeAssembler

1Spark1Metagenome1Assembler10de10bruijn

2Spark2Metagenome2Assembler2de2bruijn

10Spark10Metagenome10Assembler5de5bruijn

Sample1Sample2Sample3

Page 16: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Read clustering with Spark: preprocess

DataFiltering

Page 17: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Read clustering with Spark: core

LocalSimilarity GlobalSimilarity

ReadGraph

Page 18: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Toy test datasets

•  Species: –  6 bacterial species –  Synthetic communities with random proportions of

each •  Data: single genome sequence data (synthetic & real

reads)

Page 19: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Cluster evaluation criteria: NMI

NMI: normalized mutual information

MutualInforma-on:Howpurethedifferentclustersare

Entropy:Penalizeshavingsmallclusters

hbp://nlp.stanford.edu/IR-book/html/htmlediZon/evaluaZon-of-clustering-1.html

Page 20: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Testing Environments

•  Local –  Algorithm development –  32-core –  256GB memory

•  HPC-Lawrencium –  Small scale analysis –  CPU: INTEL XEON E5-2670 –  16-core per node –  64GB memory per node –  Infiniband FDR

•  NERSC-Cori –  Large scale analysis –  CPU: Cray Haswell –  32-core per node –  128GB memory per node –  Cray Aries high-speed interconnect with Dragonfly topology

Page 21: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Local Similarity

InidealsituaZon(noerrors,norepeZZvesequences,sufficientsequencecoverage):readclusteringwithlocalsimilarityworksperfectly.

Readsofthesamecolorbelongtothesamegenome

Withreal-worldsituaZonswhere:Sequencingcoverageislow,manysmallclustersmayformfromasamegenome,leadstoFalseNegaZvesDifferentgenomesharesequences,theycanfallintothesamecluster,leadstoFalsePosiZves.

Page 22: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Some performance metrics

Needs500-700Xofmemory–opZmizaZonisneeded

XiandongMeng@JGI

Page 23: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Global Similarity: input Parameters

No samples 1-1000

K-mer Length 20-50

K-Means Clusters 10-500

Eigen K-mers to sample 1-10,000

Eigen Reads to sample 100-60,000

Global Weight 0-150

Power Iteration Clusters 10-150

Power Iteration Steps 0-50

Page 24: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Exploring the parameter space

JordanHoffman@Harvard

Page 25: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Overall impression of Spark

ü  Easytodevelop?  Robust?  Scaletobigdata?  Efficient

§  VSHadoop/PIG§  VSMPI

Page 26: Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed

Acknowledgements

BioPigTeamHenrikNordbergKaiWangKaranBhat@AmazonLizhenShi,WeikuanYu@FSUSparkTeamXiandongMengJordanHoffman

LisaGerhardt,EvanRacahShaneCannon@NERSCGaryJung,GregKurtzerBernardLi,YongQin@HPC