Interactive Scientific Image Analysis using Spark

44
SUMMIT EAST SUMMIT EAST Interactive Scientific Image Analysis and Analytics using Spark Kevin Mader Spark East, NYC, 19 March 2015

Transcript of Interactive Scientific Image Analysis using Spark

Page 1: Interactive Scientific Image Analysis using Spark

SUMMIT EASTSUMMIT EAST

InteractiveScientificImageAnalysisandAnalyticsusingSparkKevinMaderSparkEast,NYC,19March2015

Page 2: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

OutlineBackground:OurTechnique(whywehavebigdata)

X-RayTomographicMicroscopy

Imagingin2015

TheProblem(s)

TheToolsSparkImagingLayer

3DImaging

HyperspectralImaging

InteractiveAnalysis/Streaming

TheScienceGenomeScaleStudies

LargeDatasets

Outlook/Developments

Page 3: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Synchrotron-basedX-RayTomographicMicroscopyTheonlytechniquewhichcandoall

peerdeepintolargesamples

achieve isotropicspatialresolution

with1.8mmfieldofview

achieve>10Hztemporalresolution

8GB/sofimages

[1]Moksoetal.,J.Phys.D,46(49),2013

< 1μm

CourtesyofM.PistoneatU.Bristol

Page 4: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ImageSciencein2015:MoreandfasterX-Ray

SwissLightSource(SRXTM)imagesat(>1000fps) 8GB/s,diffractionpatterns(cSAXS)at30GB/s

Nanoscopium(Soleil),10TB/day,10-500GBfilesizes,veryheterogenousdata

OpticalLight-sheetmicroscopy(see ofJeremyFreeman)producesimages500MB/s

High-speedconfocalimagesat(>200fps)78Mb/s

GeospatialNewsatelliteprojects(Skybox,etc)willmeasurehundredsofterabytestopetabytesofimagesayear

talk→

PersonalGoPro4Black-60MB/s(3840x2160x30fps)for$600

-400MB/s(640x480x840fps)for$400fps1000

Page 5: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

HowmuchisaTB,really?Ifyoulookedatone1000x1000sizedimageeverysecond

Itwouldtakeyou139hourstobrowsethroughaterabyteofdata.

Year Timeto1

TB

Manpowerto

keepup

SalaryCosts/

Month

2000 4096min 2people 25kCHF

2008 1092min 8people 95kCHF

2014 32min 260people 3255kCHF

2016 2min 3906people 48828kCHF

Page 6: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Computinghaschanged:ParallelMooresLaw

Basedondatafrom

Transistors ∝ 2T/(18 months)

https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d

Therearenowmanymoretransistorsinsideasinglecomputerbuttheprocessingspeedhasn'tincreased.Howcanthisbe?

MultipleCore

Manymachineshavemultiplecoresforeachprocessorwhichcanperformtasksindependently

MultipleCPUs

Morethanonechipiscommonlypresent

Newmodalities

GPUsprovidemanycoreswhichoperateatslowspeed

ParallelCodeisimportant

Page 7: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

CloudComputingCostsThefigureshowstherangeofcloudcosts(determinedbypeakusage)comparedtoalocalworkstationwithutilizationshownastheaveragenumberofhoursthecomputerisusedeachweek.

Thefigureshowsthecostofacloudbasedsolutionasapercentageofthecostofbuyingasinglemachine.Thevaluesbelow1showthepercentageasanumber.Thepanelsdistinguishtheaveragetimetoreplacementforthemachinesinmonths

Page 8: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

TheProblemThereisafloodofnewdataWhattookanentirePhD3-4yearsago,cannowbemeasuredinaweekend,orevenseveralseconds.Analysistoolshavenotkeptup,aredifficulttocustomize,andusuallyhighlyspecific.

OptimizedData-StructuresdonotfitData-structuresthatwerefastandefficientforcomputerswith640kbofmemorydonotmakesenseanymore

Single-corecomputingistooslowCPU'sarenotgettingthatmuchfasterbuttherearealotmoreofthem.Iteratingthroughahugearraytakesalmostaslongon2014hardwareas2006hardware

Page 9: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ExploratoryImageProcessingPrioritiesCorrectnessThemostimportantjobforanypieceofanalysisistobecorrect.

Apowerfultestingframeworkisessential

Avoidrepetitionofcodewhichleadstoinconsistencies

Usecompilerstofindmistakesratherthanusers

Easilyunderstood,changed,andusedAlmostallimageprocessingtasksrequireanumberofpeopletoevaluateandimplementthemandarealmostalwaysmovingtargets

Flexible,modularstructurethatenablesreplacingspecificpieces

FastThelastofthemajorprioritiesisspeedwhichcoversbothscalability,rawperformance,anddevelopmenttime.

Longwaitsforprocessingdiscouragesexploration

Manualaccesstodataonsepareatedisksisahugespeedbarrier

Real-timeimageprocessingrequiresmillisecondlatencies

Implementingnewideascanbedonequickly

Page 10: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

TheFrameworkFirstRatherthanbuildingananalysisasquicklyaspossibleandthentryingtohackittoscaleuptolargedatasets

chosetheframeworkfirst

thenstartmakingthenecessarytools.

Google,Amazon,Yahoo,andmanyothercompanieshavemadehugein-roadsintotheseproblems

Therealneedisafast,flexibleframeworkforrobustly,scalablyperformingcomplicatedanalyses,asortofExcelforbigimagingdata.

ApacheSparkandHadoop2Thetwoframeworksprovideafreeoutoftheboxsolutionfor

scalingto>10000computers

storingandprocessingexabytesofdata

faulttolerance

2/3rdsofcomputerscancrashandarequeststillaccuratelyfinishes

hardwareandsoftwareplatformindpendence(Mac,Windows,Linux)

Page 11: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Spark->Microscopy?TheseframeworksarereallycoolandSparkhasabigvocabulary,butflatMap,filter,aggregate,join,groupBy,andfoldstilldonotsoundlikeanythingIwanttodotoanimage.

Iwantto

filteroutnoise,segment,chooseregionsofinterest

contour,componentlabel

measure,count,andanalyze

SparkImageLayerDevelopedat , ,and

TheSparkImageLayerisaDomainSpecificLanguageforMicroscopyforSpark.

Itconvertscommonimagingtasksintocoarse-grainedSparkoperations

4Quant ETHZurichPaulScherrerInstitut

Page 12: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

SparkImageLayerWehavedevelopedanumberofcommandsforSILhandlingstandardimageprocessingtasks

Fullyexensiblewith

Page 13: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Usecase:HyperspectralImagingHyperspectralimagingisarapidlygrowingareawiththepotentiallyformassivedatasetsandaseveredeficitofusuabletools.

Thescaleofthedataislargeandstandardimageprocessingtoolsareill-suitedforhandlingthem,althoughtheideasusedinimageprocessingareequallyapplicabletohyperspectraldata(filtering,thresholding,segmentation,…)anddistributed,parallelapproachesmakeevenmoresenseonsuchmassivedatasets

Page 14: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

FlexibilitythroughTypesDevelopinginScalabringsadditionalflexibilitythroughtypes[1],withmicroscopythestandardformatsare2-,3-andeven4-ormoredimensionalarraysormatriceswhichcanbeiteratedthroughquicklyusingCPUandGPUcode.WhilestillpossibleinScala,thereisagreatdealmoreflexibilityfordatatypesallowinganythingtobestoredasanimageandthenprocessedaslongasbasicfunctionsmakesense.

[1]FightingBitRotwithTypes(ExperienceReport:ScalaCollections),MOdersky,FSTTCS2009,December2009

Whatisanimage?Acollectionofpositionsandvalues,maybemore(notanarrayofdouble).Arraysareefficientforstoringincomputermemory,butoftenapoorwayofexpressingscientificideasandanalyses.

FilterNoise?

combine information from nearbypixels

Findobjects

determine groups of pixelswhich are very similar todesired result

Page 15: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

MakingCodingSimplerwithTypestrait BasicMathSupport[T] extends Serializable { def plus(a: T, b: T): T def times(a: T, b: T): T def scale(a: T, b: Double): T def negate(a: T): T = scale(a,-1) def invert(a: T): T def abs(a: T): T def minus(a: T, b: T): T = plus(a, negate(b)) def divide(a: T, b: T): T = times(a, invert(b)) def compare(a: T, b: T): Int}

Page 16: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ContinuingwithTypesSimplefilterimplementation

Spectraaswellsupportedtypes

def SimpleFilter[T](inImage: Image[T])(implicit val wst: BasicMathSupport[T]) = {val width: Double = 1kernel = (pos: D3int,value: T) => value * exp(-(pos.mag/width)**2)kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5runFilter(inImage,kernel,kernelReduce)}

implicit val SpectraBMS = new BasicMathSupport[Array[Double]] { def plus(a: Array[Double], b: Array[Double]) = a.zip(b).map(_ + _)... def scale(a: Array[Double], b: Double) = a.map(_*b)

Page 17: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

InteractiveAnalysisCombiningmanydifferentcomponentstogetherinsideoftheSparkShell,IPythonorZeppelin,makeiteasiertoassembleworkflows

Page 18: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ScientificCases:Genome-scaleImagingWewanttounderstandtherelationshipbetweengeneticbackgroundandbonestructure

Withexistingtools,analysisispossibleandanumberofpublicationshavebeenmade,evenonesthatshowdifferencesbetweenstrainsofmice

But

n<12

time-consuming(yearsbetweenmeasurementandpublication)

notflexibleorreproducible

notcloud-based

Page 19: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Genome-ScaleImagingGeneticstudiesrequirehundredstothousandsofsamples,inthiscasethedifferencebetween717and1200samplesisthedifferencebetweenfindingthelinksandfindingnothing.

2008approach-120yearsHandIdentification->30s/object

30-40kobjectspersample

OneSamplein6.25weeks

2014approach-1.5yearsImageJmacroforsegmentation(2-4hours/sample)

Pythonscriptforshapeanalysis(3hours/sample)

Paraviewmacrofornetworkandconnectivity(2hours/sample)

Pythonscripttopoolresults(3-4hours)

MySQLDatabasestoringresults(5minutes/query)

Page 20: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

GeneticStudiesusingSparkImageLayerAnalysiscouldbecompletedinseveralmonths(insteadof120years,couldnowbecompletedindaysinthecloud)

Datacanbefreelyexploredandanalyzed

val bones = sc.loadImages("work/f2_bones/*/bone.tif")Segmenthardandsofttissues

Labelcells

Exportresults

val hardTissue = bones.threshold(OTSU)val softTissue = hardTissue.invert

val cells = hardTissue.componentLabel. filter(c=>c.size>100 & c.size<1000)

cells.shapeAnalysis.WriteOutput("lacuna.csv")

Page 21: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ParallelToolsforImageandQuantitativeAnalysisval cells = sqlContext.csvFile("work/f2_bones/*/cells.csv")val avgVol = sqlContext.sql("select SAMPLE,AVG(VOLUME) FROMcells GROUP BY SAMPLE")Collaborators/Competitorscanverifyresultsandextendonanalyses

CombineImageswithResults

avgVol.filter(_._2>1000).map(sampleToPath).joinByKey(bones)Seeimmediatelyindatasetsofterabyteswhichimagehadthelargestcells

Newhypothesesandanalysescanbedoneinseconds/minutes

Task SingleCoreTime SparkTime(40cores)

LoadandPreprocess 360minutes 10minutes

SingleColumnAverage 4.6s 400ms

1K-meansIteration 2minutes 1s

Page 22: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ScienceProblems:FullBrainImagingCollaborationwithA.AstolfoandA.Patera

Measureafullmousebrain(1cm )withcellularresolution(1 m)

10x10x10scansat2560x2560x216014TVoxels

0.000004%oftheentiredataset

3

μ

14TVoxels=56TB

Eachscanneedstoberegisteredandalignedtogether

Therearenocomputerswith56TBofmemory

Evenmultithreadedapproachsarenotfeasibleandrequiremanylogistics

Analysisofthestitcheddataisalsoofinterest(segmentation,vesselanalysis,distributionandnetworkconnectivity)

Page 23: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ScienceProblems:BigStitchingImages : RDD[((x, y, z), Img[Double])] =

[( , Img),…]x dispField = Images. cartesian(Images).map{ case ((xA,ImA), (xB,ImB)) => xcorr(ImA,ImB,in=xB-xA) }

Page 24: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

FromMatchingtoStitchingFromtheupdatedinformationprovidedbythecrosscorrelationsandbyapplyingappropriatesmoothingcriteria(ifnecessary).

Thestitchingitself,ratherthanrewritingtheoriginaldatacanbedoneinalazyfashionascertainregionsoftheimageareread.

Thisalsoensurestheoriginaldataisleftunalteredandallanalysisisreversible.

def getView(tPos,tSize) = stImgs. filter(x=>abs(x-tPos)<img.size). map { case (x,img) => val oImg = new Image(tSize) oImg.copy(img,x,tPos)}.addImages(AVG)

Page 25: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ViewingRegionsgetView(Pos(26.5,13),Size(2,2))

Page 26: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Real-timewithSparkStreaming:WebcamInthebiologicalimagingcommunity,theopensourcetoolsofImageJ2andFijiarewidelyacceptedandhavealargenumberofreadilyavailablepluginsandtools.

WecanintegratethefunctionalitydirectlyintoSparkandperformoperationsonmuchlargerdatasetsthanasinglemachinecouldhaveinmemory.Additionallytheseanalysescanbeperformedonstreamingdata.

Page 27: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

StreamingAnalysisReal-timeWebcamProcessing

Filterimages

Createabackgroundimage

val wr = new WebcamReceiver()val ssc = sc.toStreaming(strTime)val imgList = ssc.receiverStream(wr)

val filtImgs = allImgs.mapValues(_.run("Median...","radius=3"))

val totImgs = inImages.count()val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)

Page 28: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

IdentifyOutliersinStreamsRemovethebackgroundimageandfindthemeanvalue

Showtheoutliers

val eventImages = filtImgs. transform{ inImages => val corImage = inImages.map { case (inTime,inImage) => val corImage = inImage.subtract(bgImage) (corImage.getImageStatistics().mean, (inTime,corImage)) } corImage }

eventImages.filter(iv => Math.abs(iv._1)>20). foreachRDD(showResultsStr("outlier",_))

Page 29: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

StreamingDemowithWebcam

Page 30: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Asascientist(notadata-scientist)ApacheSparkisbrilliantplatformandutilizingGraphX,MLLib,andotherpackagesthereunlimitedpossibilities

Scalacanbeabeautifulbutnoteasylanguage

Pythonisaneasierlanguage

Bothsufferfrom

Non-obviousworkflows

Scriptsdependingonscriptsdependingonscripts(canbeveryfragile)

Althoughallanalysescanbeexpressedasaworkflow,thisisoftendifficulttoseefromthecode

Non-technicalpersonshavelittleabilitytounderstandormakeminoradjustmentstoanalysis

Parametersrequirerecompilingtochange

orGUIsneedtobeplacedontop

Page 31: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

AbasicimagefilteringoperationThankstoSpark,itiscached,inmemory,approximate,cloud-ready

ThankstoMap-Reduceitisfault-tolerant,parallel,distributed

ThankstoJava,itishardwareagnostic

Butitisalsonotreallysoreadable

def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = { val wind=(-windSize to windSize) val pos=pvec._1 val scalevalue=pvec._2/(wind.length*wind.length) for(x<-wind; y<-wind) yield ((pos._1+x,pos._2+y),scalevalue)}

val filtImg=roiImg. flatMap(cvec => spread_voxels(cvec)). filter(roiFun).reduceByKey(_ + _)

Page 32: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

LittleblocksforbigdataHereweusea -basedworkflowandourSparkImagingLayerextensionstocreateaworkflowwithoutanyScalaorprogrammingknowledgeandwithaneasilyvisibleflowfromoneblocktothenextwithoutanyperformanceoverheadofusingothertools.

KNIME

Page 33: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

RealityCheckSparkisnotperformant dedicated,optimizedCPUandGPUcodeswillperformslightlytomuchmuchbetterwhenevaulatedbypixelspersecondperprocessingpowerunit

thesecodeswillbewildlyoutperformedbydedicatedhardware/FPGAsolutions

Serializationoverheadandnetworkcongestionarenotneglibleforlargedatasets

→ ButScala/PythoninSparkissubstantiallyeasiertowriteandtest

Highlyoptimizedcodesareveryinflexible

Humantimeis400xmoreexpensivethanAWStime

Mistakesduetopoortestingcanbefatal

Sparkscalessmoothlytoenormousdatasets

GPUsrarelyhavemorethanafewgigabytes

Writingcodethatpagestodiskispainful

Sparkishardwareagnostic(nodriversorvendorlock-in)

Page 34: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

Wehaveacooltool,butwhatdoesthismeanforme?Aspinoff-4Quant:Fromimagestoinsight

CloudImageProcessing

UseourdistributedversionofImageJinthecloudtoanalyzethousandsofremotedatasetsusingyourown,ours,orcommunityprovidedprocessingroutines

CustomAnalysisSolutions

Custom-tailoredsoftwaretosolveyourproblems

OneStopShop

Measurement,analysis,andstatisticalanalysis

Education/TrainingConsulting

Adviceonimagingtechniques,analysispossibilities

Developmentofnewanalysistoolsandworkflows

Education

WorkshopsonImageAnalysis

Courses/Training

QuantitativeBigImaging

Page 35: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

AcknowledgementsAITatPSIandScientificComputeratETH

TOMCATGroup

Weareinterestedinpartnershipsandcollaborations

Learnmoreat4Quant:FromImagestoStatistics-

X-RayImagingGroupatETHZurich-

http://www.4quant.com

http://bit.ly/1gD8wKb

QuantitativeBigImagingCourseatETHZurich

Page 36: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

FeatureVectorsApairingbetweenspatialinformation(position)andsomeotherkindofinformation(value).

Weareusedtoseeingimagesinagridformatwherethepositionindicatestherowandcolumninthegridandtheintensity(absorption,reflection,tipdeflection,etc)isshownasadifferentcolor

→x f

Thealternativeformforthisimageisasalistofpositionsandacorrespondingvalue

x y Intensity

1 1 12

2 1 68

3 1 81

4 1 89

5 1 87

1 2 40

ThisrepresentationcanbecalledthefeaturevectorandinthiscaseitonlyhasIntensity

= ( , )I x f

Page 37: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

WhyFeatureVectorsIfweusefeaturevectorstodescribeourimage,wearenolongertoworryingabouthowtheimageswillbedisplayed,andcanfocusonthesegmentation/thresholdingproblemfromaclassificationratherthanaimage-processingstandpoint.

ExampleSowehaveanimageofacellandwewanttoidentifythemembrane(thering)fromthenucleus(thepointinthemiddle).

Asimplethresholddoesn'tworkbecauseweidentifythepointinthemiddleaswell.Wecouldtrytousemorphologicaltrickstogetridofthepointinthemiddle,orwecouldbettertuneoursegmentationtotheringstructure.

Page 38: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

AddinganewfeatureInthiscaseweaddaverysimplefeaturetotheimage,thedistancefromthecenteroftheimage(distance).

x y Intensity Distance

-10 -10 0.9350683 14.14214

-10 -9 0.7957197 13.45362

-10 -8 0.6045178 12.80625

-10 -7 0.3876575 12.20656

-10 -6 0.1692429 11.66190

Wenowhaveamorecomplicatedimage,whichwecan'taseasilyvisualize,butwecanincorporatethesetwopiecesofinformationtogether.

Page 39: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

ApplyingtwocriteriaNowinsteadoftryingtofindtheintensityforthering,wecancombinedensityanddistancetoidentifyit

if f (5 < Distance < 10&0.5 < Intensity > 1.0)

Page 40: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

CommonFeaturesThedistancewhileillustrativeisnotacommonlyusedfeatures,morecommonvariousfiltersappliedtotheimage

GaussianFilter(informationonthevaluesofthesurroundingpixels)

Sobel/CannyEdgeDetection(informationonedgesinthevicinity)

Entroy(informationonvariabilityinvicinity)

x y Intensity Sobel Gaussian

1 1 0.94 0.32 0.53

1 10 0.48 0.50 0.45

1 11 0.50 0.50 0.46

1 12 0.48 0.64 0.46

1 13 0.43 0.78 0.45

1 14 0.33 0.94 0.42

Page 41: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

AnalyzingthefeaturevectorThedistributionsofthefeaturesappearverydifferentandcanthuslikelybeusedforidentifyingdifferentpartsoftheimages.

Combinethiswithouraprioriinformation(calledsupervisedanalysis)

Page 42: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

UsingMachineLearningNowthattheimagesarestoredasfeaturevectors,theycanbeeasilyanalyzedwithstandardMachineLearningtools.Itisalsomucheasiertocombinewithtraininginformation.

x y Absorb Scatter Training

700 4 0.3706262 0.9683849 0.0100140

704 4 0.3694059 0.9648784 0.0100140

692 8 0.3706371 0.9047878 0.0183156

696 8 0.3712537 0.9341989 0.0334994

700 8 0.3666887 0.9826912 0.0453049

704 8 0.3686623 0.8728824 0.0453049

WanttopredictTrainingfromx,y,Absorb, and Scatter MLLib:LogisticRegression,RandomForest,K-NearestNeighbors,…

Page 43: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

BeyondImageProcessingFormanydatasetsprocessing,segmentation,andmorphologicalanalysisisalltheinformationneededtobeextracted.Formanysystemslikebonetissue,cellulartissues,cellularmaterialsandmanyothers,thestructureisjustthebeginningandthemostinterestingresultscomefromtheapplicationtophysical,chemical,orbiologicalrulesinsideofthesestructures.

= m∑j

F ij xi

Suchsystemscanbeeasilyrepresentedbyagraph,andanalyzedusingGraphXinadistributed,faulttolerantmanner.

Page 44: Interactive Scientific Image Analysis using Spark

SUMMIT EAST

HadoopFilesystem(HDFSnotHDF5)Bottleneckisfilesystemconnection,manynodes(10+)readinginparallelbringsevenGPFS-basedinfinibandsystemtoacrawl

OneofthecentraltenantsofMapReduce™isdata-centriccomputation insteadofdatatocomputation,movethecomputationtothedata.

Usefastlocalstorageforstoringeverythingredundantly lesstransferandfault-tolerance

Largestfilesize:512yottabytes,Yahoohas14petabytefilesysteminuse