Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process...

Explore Spark for Metagenome assembly

Zhong Wang, Ph.D. Group Lead, Genome Analysis

National Microbiome Initiative

Environmental microbial communities are complex

1 10 100 1000 10000

Human Soil

Number of Species

>90%ofthespecieshaven’tbeenseenbefore

Decode Metagenome

Metagenome ShortReads

ShotgunSequencing

MetagenomeAssembly

AssembledGenomes

BillionsofpiecesTerabytesinsize“BigData”

LibraryofBooks ShreddedLibrary “reconstructed”Library

Genome~=BookMetagenome~=Library

The Ideal Solution

•  Easytodevelop•  Robust•  Scaletobigdata•  Efficient

2009: Special Hardware

Input/Output(IO)Memory

$1MOnlyscaleupto~100Gb

JeremyBrand@JGIFPGA@Convey

2010: MP/MPI on supercomputers

•  ExperiencedsoFwareengineers•  SixmonthsofdevelopmentJme•  Onetaskfails,alltasksfail

Problems: Fast,scalable

RobEgan@JGI

MPI version412 Gb, 4.5B reads

2.7 hours on 128x24 coresNESRC Supercomputer

2011: Hadoop/Map Reduce framework

•  Google MapReduce –  Data Parallel programming model to process petabyte

data –  Generally has a map and a reduce step

•  Apache Hadoop –  Distributed file system (HDFS) and job handling for

scalability and robustness –  Data locality to bring compute to data, avoiding

network transfer bottleneck

Programmability: Java vs Pig

findingouttop5websitesyoungpeoplevisit

2013: BioPig

BioPig-Blaster

BioPig-Assembler

BioPig-Extender

BioPig:61linesofcodeMPI-extender:~12,000lines(vs31inBioPig)

Robust

Programmability

Scalability

KaranBha?a,HenrikNordberg,KaiWang

BioPIG

Challenges in application

•  2-3 orders of magnitude slower than MPI •  IO optimization, e.g., reduce data copying •  Some problems do not easily fit into map/reduce

framework, e.g., graph-based algorithms •  Runs on AWS, but cost $$$ if not optimized

Optimizing BioPig

100 8 17

3 11 8

Time saving (60GB)

1GB 10GB 20GB 40GB 60GB 100GB

Data Size (GB)

Baseline

3XSpeedup

LizhenShi,WeikuanYu@FSU

SZllverylowefficiency!

Addressing big data: Apache Spark

•  Newscalableprogrammingparadigm•  CompaZblewithHadoop-supportedstoragesystems

•  Improvesefficiencythrough:•  In-memorycompuZngprimiZves•  GeneralcomputaZongraphs

•  Improvesusabilitythrough:•  RichAPIsinJava,Scala,Python•  InteracZveshell

Goal: Metagenome read clustering

•  Data characteristics: –  Total data size typically 100Gb – 1Tb –  >1 billion short pieces (reads, each 100-200bp) –  >1,000 different species, some species are more

similar than others –  Sequence errors 1-2%

•  Proposed approach: Divide-and-conquer – Cluster reads from each genome (Clustering) – Assemble each cluster in parallel (Assembly)

•  Local information: overlap

•  Global information: covariance

Read clustering with Spark: idea

SparkMetagenomeMetagenomeAssembgenomeAssembler

1Spark1Metagenome1Assembler10de10bruijn

Sample1Sample2Sample3

Read clustering with Spark: preprocess

DataFiltering

Read clustering with Spark: core

LocalSimilarity GlobalSimilarity

ReadGraph

Toy test datasets

•  Species: –  6 bacterial species –  Synthetic communities with random proportions of

each •  Data: single genome sequence data (synthetic & real

reads)

Cluster evaluation criteria: NMI

NMI: normalized mutual information

MutualInforma-on:Howpurethedifferentclustersare

Entropy:Penalizeshavingsmallclusters

hbp://nlp.stanford.edu/IR-book/html/htmlediZon/evaluaZon-of-clustering-1.html

Testing Environments

•  Local –  Algorithm development –  32-core –  256GB memory

•  HPC-Lawrencium –  Small scale analysis –  CPU: INTEL XEON E5-2670 –  16-core per node –  64GB memory per node –  Infiniband FDR

•  NERSC-Cori –  Large scale analysis –  CPU: Cray Haswell –  32-core per node –  128GB memory per node –  Cray Aries high-speed interconnect with Dragonfly topology

Local Similarity

InidealsituaZon(noerrors,norepeZZvesequences,sufficientsequencecoverage):readclusteringwithlocalsimilarityworksperfectly.

Readsofthesamecolorbelongtothesamegenome

Withreal-worldsituaZonswhere:Sequencingcoverageislow,manysmallclustersmayformfromasamegenome,leadstoFalseNegaZvesDifferentgenomesharesequences,theycanfallintothesamecluster,leadstoFalsePosiZves.

Some performance metrics

Needs500-700Xofmemory–opZmizaZonisneeded

XiandongMeng@JGI

Global Similarity: input Parameters

No samples 1-1000

K-mer Length 20-50

K-Means Clusters 10-500

Eigen K-mers to sample 1-10,000

Eigen Reads to sample 100-60,000

Global Weight 0-150

Power Iteration Clusters 10-150

Power Iteration Steps 0-50

Exploring the parameter space

JordanHoffman@Harvard

Overall impression of Spark

ü  Easytodevelop?  Robust?  Scaletobigdata?  Efficient

§  VSHadoop/PIG§  VSMPI

Acknowledgements

BioPigTeamHenrikNordbergKaiWangKaranBhat@AmazonLizhenShi,WeikuanYu@FSUSparkTeamXiandongMengJordanHoffman

LisaGerhardt,EvanRacahShaneCannon@NERSCGaryJung,GregKurtzerBernardLi,YongQin@HPC

Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process...

Documents

Transcript of Explore Spark for Metagenome assembly - NERSC · – Data Parallel programming model to process...

Overview of NERSC Analytics Program Cecilia Aragon NERSC Analytics Team aragon@hpcrd.lbl.gov NERSC User Group Meeting September 17, 2007

Metagenome Binning - GoSeqIt

Introduction to NERSC Archival Storage: HPSS · Lisa Gerhardt! NERSC User Services! Nick Balthaser! NERSC Storage Systems! NUG Training! February 3, 2014 Introduction to NERSC Archival

Building PetaByte Servers

Overview of NERSC Analytics Program Cecilia Aragon NERSC Analytics Team aragon@hpcrd.lbl.gov NERSC User Group Meeting September 17, 2007.

NERSC Workload Analysis€¦ · NERSC Workload Analysis John Shalf SDSA Team Leader jshalf@lbl.gov NERSC User Group Meeting September 17, 2007. NERSC User Group Meeting, September

The NERSC Global File System NERSC June 12th, 2006

Tera/Petabyte data distribution architectures

Metagenome : fungal and bacterial interactions

2015 ohsu-metagenome

National Energy Research Scientific Computing Center (NERSC) NERSC Site Report

Petabyte scale on commodity infrastructure

Metagenome, metatranscriptome and single-cell sequencing ...hazenlab.utk.edu/files/pdf/2012Mason_etal_ismej.pdf · ORIGINAL ARTICLE Metagenome, metatranscriptome and single-cell sequencing

NERSC Users Group Meeting Stephen Lau NERSC November 6, 2014

PetaByte Storage Facility at RHIC

GigaByte TeraByte PetaByte ExaByte In Search of PetaByte Databases Jim Gray Tony Hey.

1 The NERSC Global File System NERSC June 12th, 2006.

Scalable microbiome metagenome assembly and profiling

In Search of PetaByte Databases

2015 pag-metagenome