Interactive Scientific Image Analysis using Spark

of 44 /44
SUMMIT EAST SUMMIT EAST Interactive Scientific Image Analysis and Analytics using Spark Kevin Mader Spark East, NYC, 19 March 2015

Embed Size (px)

Transcript of Interactive Scientific Image Analysis using Spark

  1. 1. SUMMIT EAST SUMMIT EAST InteractiveScientificImage AnalysisandAnalyticsusingSpark KevinMader SparkEast,NYC,19March2015
  2. 2. SUMMIT EAST Outline Background:OurTechnique(whywe havebigdata) X-RayTomographicMicroscopy Imagingin2015 TheProblem(s) TheTools SparkImagingLayer 3DImaging HyperspectralImaging InteractiveAnalysis/Streaming TheScience GenomeScaleStudies LargeDatasets Outlook/Developments
  3. 3. SUMMIT EAST Synchrotron-basedX-RayTomographicMicroscopy Theonlytechniquewhichcandoall peerdeepintolargesamples achieve isotropicspatial resolution with1.8mmfieldofview achieve>10Hztemporalresolution 8GB/sofimages [1]Moksoetal.,J.Phys.D,46(49),2013 < 1m CourtesyofM.PistoneatU.Bristol
  4. 4. SUMMIT EAST ImageSciencein2015:Moreandfaster X-Ray SwissLightSource(SRXTM)imagesat (>1000fps) 8GB/s,diffraction patterns(cSAXS)at30GB/s Nanoscopium(Soleil),10TB/day,10- 500GBfilesizes,veryheterogenousdata Optical Light-sheetmicroscopy(see of JeremyFreeman)producesimages 500MB/s High-speedconfocalimagesat(>200fps) 78Mb/s Geospatial Newsatelliteprojects(Skybox,etc)will measurehundredsofterabytesto petabytesofimagesayear talk Personal GoPro4Black-60MB/s(3840x2160x 30fps)for$600 -400MB/s(640x480x840fps) for$400 fps1000
  5. 5. SUMMIT EAST HowmuchisaTB,really? Ifyoulookedatone1000x1000sized imageeverysecond Itwouldtakeyou 139hourstobrowsethroughaterabyteof data. Year Timeto1 TB Manpowerto keepup SalaryCosts/ Month 2000 4096min 2people 25kCHF 2008 1092min 8people 95kCHF 2014 32min 260people 3255kCHF 2016 2min 3906people 48828kCHF
  6. 6. SUMMIT EAST Computinghaschanged:Parallel MooresLaw Basedondatafrom Transistors 2T/(18 months) https://gist.github.com/humberto- ortiz/de4b3a621602b78bf90d Therearenowmanymoretransistorsinside asinglecomputerbuttheprocessingspeed hasn'tincreased.Howcanthisbe? MultipleCore Manymachineshavemultiplecores foreachprocessorwhichcanperform tasksindependently MultipleCPUs Morethanonechipiscommonly present Newmodalities GPUsprovidemanycoreswhich operateatslowspeed ParallelCodeisimportant
  7. 7. SUMMIT EAST CloudComputingCosts Thefigureshowstherangeofcloudcosts (determinedbypeakusage)comparedtoa localworkstationwithutilizationshownas theaveragenumberofhoursthecomputer isusedeachweek. Thefigureshowsthecostofacloudbased solutionasapercentageofthecostof buyingasinglemachine.Thevaluesbelow1 showthepercentageasanumber.The panelsdistinguishtheaveragetimeto replacementforthemachinesinmonths
  8. 8. SUMMIT EAST TheProblem Thereisafloodofnewdata WhattookanentirePhD3-4yearsago,cannowbemeasuredinaweekend,orevenseveral seconds.Analysistoolshavenotkeptup,aredifficulttocustomize,andusuallyhighly specific. OptimizedData-Structuresdonotfit Data-structuresthatwerefastandefficientforcomputerswith640kbofmemorydonot makesenseanymore Single-corecomputingistooslow CPU'sarenotgettingthatmuchfasterbuttherearealotmoreofthem.Iteratingthrougha hugearraytakesalmostaslongon2014hardwareas2006hardware
  9. 9. SUMMIT EAST ExploratoryImageProcessingPriorities Correctness Themostimportantjobforanypieceof analysisistobecorrect. Apowerfultestingframeworkis essential Avoidrepetitionofcodewhichleadsto inconsistencies Usecompilerstofindmistakesrather thanusers Easilyunderstood,changed,and used Almostallimageprocessingtasksrequirea numberofpeopletoevaluateand implementthemandarealmostalways movingtargets Flexible,modularstructurethatenables Fast Thelastofthemajorprioritiesisspeed whichcoversbothscalability,raw performance,anddevelopmenttime. Longwaitsforprocessingdiscourages exploration Manualaccesstodataonsepareatedisks isahugespeedbarrier Real-timeimageprocessingrequires millisecondlatencies Implementingnewideascanbedone quickly
  10. 10. SUMMIT EAST TheFrameworkFirst Ratherthanbuildingananalysisas quicklyaspossibleandthentryingto hackittoscaleuptolargedatasets chosetheframeworkfirst thenstartmakingthenecessarytools. Google,Amazon,Yahoo,andmanyother companieshavemadehugein-roadsinto theseproblems Therealneedisafast,flexible frameworkforrobustly,scalably performingcomplicatedanalyses,asort ofExcelforbigimagingdata. ApacheSparkandHadoop2 Thetwoframeworksprovideafreeoutof theboxsolutionfor scalingto>10000computers storingandprocessingexabytesofdata faulttolerance 2/3rdsofcomputerscancrashanda requeststillaccuratelyfinishes hardwareandsoftwareplatform indpendence(Mac,Windows,Linux)
  11. 11. SUMMIT EAST Spark->Microscopy? TheseframeworksarereallycoolandSpark hasabigvocabulary,butflatMap,filter, aggregate,join,groupBy,andfoldstilldonot soundlikeanythingIwanttodotoan image. Iwantto filteroutnoise,segment,chooseregions ofinterest contour,componentlabel measure,count,andanalyze SparkImageLayer Developedat , ,and TheSparkImageLayerisaDomain SpecificLanguageforMicroscopyfor Spark. Itconvertscommonimagingtasksinto coarse-grainedSparkoperations 4Quant ETHZurich PaulScherrerInstitut
  12. 12. SUMMIT EAST SparkImageLayer Wehavedevelopedanumberofcommands forSILhandlingstandardimageprocessing tasks Fullyexensiblewith
  13. 13. SUMMIT EAST Usecase:HyperspectralImaging Hyperspectralimagingisarapidlygrowing areawiththepotentiallyformassive datasetsandaseveredeficitofusuable tools. Thescaleofthedataislargeandstandard imageprocessingtoolsareill-suitedfor handlingthem,althoughtheideasusedin imageprocessingareequallyapplicableto hyperspectraldata(filtering,thresholding, segmentation,)anddistributed,parallel approachesmakeevenmoresenseonsuch massivedatasets
  14. 14. SUMMIT EAST FlexibilitythroughTypes DevelopinginScalabringsadditional flexibilitythroughtypes[1],withmicroscopy thestandardformatsare2-,3-andeven4- ormoredimensionalarraysormatrices whichcanbeiteratedthroughquicklyusing CPUandGPUcode.Whilestillpossiblein Scala,thereisagreatdealmoreflexibility fordatatypesallowinganythingtobe storedasanimageandthenprocessedas longasbasicfunctionsmakesense. [1]FightingBitRotwithTypes(Experience Report:ScalaCollections),MOdersky, FSTTCS2009,December2009 Whatisanimage? Acollectionofpositionsandvalues,maybe more(notanarrayofdouble).Arraysare efficientforstoringincomputermemory, butoftenapoorwayofexpressingscientific ideasandanalyses. FilterNoise? combine information from nearby pixels Findobjects determine groups of pixels which are very similar to desired result
  15. 15. SUMMIT EAST MakingCodingSimplerwithTypes trait BasicMathSupport[T] extends Serializable { def plus(a: T, b: T): T def times(a: T, b: T): T def scale(a: T, b: Double): T def negate(a: T): T = scale(a,-1) def invert(a: T): T def abs(a: T): T def minus(a: T, b: T): T = plus(a, negate(b)) def divide(a: T, b: T): T = times(a, invert(b)) def compare(a: T, b: T): Int }
  16. 16. SUMMIT EAST ContinuingwithTypes Simplefilterimplementation Spectraaswellsupportedtypes def SimpleFilter[T](inImage: Image[T]) (implicit val wst: BasicMathSupport[T]) = { val width: Double = 1 kernel = (pos: D3int,value: T) => value * exp(- (pos.mag/width)**2) kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5 runFilter(inImage,kernel,kernelReduce) } implicit val SpectraBMS = new BasicMathSupport[Array[Double]] { def plus(a: Array[Double], b: Array[Double]) = a.zip(b).map(_ + _) ... def scale(a: Array[Double], b: Double) = a.map(_*b)
  17. 17. SUMMIT EAST InteractiveAnalysis Combiningmanydifferentcomponents togetherinsideoftheSparkShell,IPython orZeppelin,makeiteasiertoassemble workflows
  18. 18. SUMMIT EAST ScientificCases:Genome-scaleImaging Wewanttounderstandtherelationship betweengeneticbackgroundandbone structure Withexistingtools,analysisispossible andanumberofpublicationshavebeen made,evenonesthatshowdifferences betweenstrainsofmice But n30s/object 30-40kobjectspersample OneSamplein6.25weeks 2014approach-1.5years ImageJmacroforsegmentation(2-4 hours/sample) Pythonscriptforshapeanalysis(3hours /sample) Paraviewmacrofornetworkand connectivity(2hours/sample) Pythonscripttopoolresults(3-4hours) MySQLDatabasestoringresults(5 minutes/query)
  19. 20. SUMMIT EAST GeneticStudiesusingSparkImageLayer Analysiscouldbecompletedinseveralmonths(insteadof120years,couldnowbe completedindaysinthecloud) Datacanbefreelyexploredandanalyzed val bones = sc.loadImages("work/f2_bones/*/bone.tif") Segmenthardandsofttissues Labelcells Exportresults val hardTissue = bones.threshold(OTSU) val softTissue = hardTissue.invert val cells = hardTissue.componentLabel. filter(c=>c.size>100 & c.size1000).map(sampleToPath).joinByKey(bones) Seeimmediatelyindatasetsofterabyteswhichimagehadthelargestcells Newhypothesesandanalysescanbedoneinseconds/minutes Task SingleCoreTime SparkTime(40cores) LoadandPreprocess 360minutes 10minutes SingleColumnAverage 4.6s 400ms 1K-meansIteration 2minutes 1s
  20. 22. SUMMIT EAST ScienceProblems:FullBrainImaging CollaborationwithA.AstolfoandA. Patera Measureafullmousebrain(1cm )with cellularresolution(1 m) 10x10x10scansat2560x2560x2160 14TVoxels 0.000004%oftheentiredataset 3 14TVoxels=56TB Eachscanneedstoberegisteredand alignedtogether Therearenocomputerswith56TBof memory Evenmultithreadedapproachsarenot feasibleandrequiremanylogistics Analysisofthestitcheddataisalsoof interest(segmentation,vesselanalysis, distributionandnetworkconnectivity)
  21. 23. SUMMIT EAST ScienceProblems:BigStitching Images : RDD[((x, y, z), Img[Double])] = [( , Img), ]x dispField = Images. cartesian(Images).map{ case ((xA,ImA), (xB,ImB)) => xcorr(ImA,ImB,in=xB-xA) }
  22. 24. SUMMIT EAST FromMatchingtoStitching Fromtheupdatedinformationprovidedby thecrosscorrelationsandbyapplying appropriatesmoothingcriteria(if necessary). Thestitchingitself,ratherthanrewriting theoriginaldatacanbedoneinalazy fashionascertainregionsoftheimageare read. Thisalsoensurestheoriginaldataisleft unalteredandallanalysisisreversible. def getView(tPos,tSize) = stImgs. filter(x=>abs(x-tPos) val oImg = new Image(tSize) oImg.copy(img,x,tPos) }.addImages(AVG)
  23. 25. SUMMIT EAST ViewingRegions getView(Pos(26.5,13),Size(2,2))
  24. 26. SUMMIT EAST Real-timewithSparkStreaming:Webcam Inthebiologicalimagingcommunity,the opensourcetoolsofImageJ2andFijiare widelyacceptedandhavealargenumberof readilyavailablepluginsandtools. Wecanintegratethefunctionalitydirectly intoSparkandperformoperationsonmuch largerdatasetsthanasinglemachinecould haveinmemory.Additionallythese analysescanbeperformedonstreaming data.
  25. 27. SUMMIT EAST StreamingAnalysisReal-timeWebcamProcessing Filterimages Createabackgroundimage val wr = new WebcamReceiver() val ssc = sc.toStreaming(strTime) val imgList = ssc.receiverStream(wr) val filtImgs = allImgs.mapValues(_.run("Median...","radius=3")) val totImgs = inImages.count() val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)
  26. 28. SUMMIT EAST IdentifyOutliersinStreams Removethebackgroundimageandfindthemeanvalue Showtheoutliers val eventImages = filtImgs. transform{ inImages => val corImage = inImages.map { case (inTime,inImage) => val corImage = inImage.subtract(bgImage) (corImage.getImageStatistics().mean, (inTime,corImage)) } corImage } eventImages.filter(iv => Math.abs(iv._1)>20). foreachRDD(showResultsStr("outlier",_))
  27. 29. SUMMIT EAST StreamingDemowithWebcam
  28. 30. SUMMIT EAST Asascientist(notadata-scientist) ApacheSparkisbrilliantplatformand utilizingGraphX,MLLib,andotherpackages thereunlimitedpossibilities Scalacanbeabeautifulbutnoteasy language Pythonisaneasierlanguage Bothsufferfrom Non-obviousworkflows Scriptsdependingonscripts dependingonscripts(canbevery fragile) Althoughallanalysescanbeexpressed asaworkflow,thisisoftendifficulttosee fromthecode Non-technicalpersonshavelittleability tounderstandormakeminor adjustmentstoanalysis Parametersrequirerecompilingto change orGUIsneedtobeplacedontop
  29. 31. SUMMIT EAST Abasicimagefilteringoperation ThankstoSpark,itiscached,inmemory,approximate,cloud-ready ThankstoMap-Reduceitisfault-tolerant,parallel,distributed ThankstoJava,itishardwareagnostic Butitisalsonotreallysoreadable def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = { val wind=(-windSize to windSize) val pos=pvec._1 val scalevalue=pvec._2/(wind.length*wind.length) for(x 1.0)
  30. 40. SUMMIT EAST CommonFeatures Thedistancewhileillustrativeisnota commonlyusedfeatures,morecommon variousfiltersappliedtotheimage GaussianFilter(informationonthe valuesofthesurroundingpixels) Sobel/CannyEdgeDetection (informationonedgesinthevicinity) Entroy(informationonvariabilityin vicinity) x y Intensity Sobel Gaussian 1 1 0.94 0.32 0.53 1 10 0.48 0.50 0.45 1 11 0.50 0.50 0.46 1 12 0.48 0.64 0.46 1 13 0.43 0.78 0.45 1 14 0.33 0.94 0.42
  31. 41. SUMMIT EAST Analyzingthefeaturevector Thedistributionsofthefeaturesappear verydifferentandcanthuslikelybeused foridentifyingdifferentpartsoftheimages. Combinethiswithouraprioriinformation (calledsupervisedanalysis)
  32. 42. SUMMIT EAST UsingMachineLearning Nowthattheimagesarestoredasfeature vectors,theycanbeeasilyanalyzedwith standardMachineLearningtools.Itisalso mucheasiertocombinewithtraining information. x y Absorb Scatter Training 700 4 0.3706262 0.9683849 0.0100140 704 4 0.3694059 0.9648784 0.0100140 692 8 0.3706371 0.9047878 0.0183156 696 8 0.3712537 0.9341989 0.0334994 700 8 0.3666887 0.9826912 0.0453049 704 8 0.3686623 0.8728824 0.0453049 WanttopredictTrainingfromx,y, Absorb, and Scatter MLLib: LogisticRegression,RandomForest,K- NearestNeighbors,
  33. 43. SUMMIT EAST BeyondImageProcessing Formanydatasetsprocessing, segmentation,andmorphologicalanalysisis alltheinformationneededtobeextracted. Formanysystemslikebonetissue,cellular tissues,cellularmaterialsandmanyothers, thestructureisjustthebeginningandthe mostinterestingresultscomefromthe applicationtophysical,chemical,or biologicalrulesinsideofthesestructures. = m j Fij xi Suchsystemscanbeeasilyrepresentedbya graph,andanalyzedusingGraphXina distributed,faulttolerantmanner.
  34. 44. SUMMIT EAST HadoopFilesystem(HDFSnotHDF5) Bottleneckisfilesystemconnection,many nodes(10+)readinginparallelbringseven GPFS-basedinfinibandsystemtoacrawl OneofthecentraltenantsofMapReduce isdata-centriccomputation insteadof datatocomputation,movethecomputation tothedata. Usefastlocalstorageforstoring everythingredundantly lesstransfer andfault-tolerance Largestfilesize:512yottabytes,Yahoo has14petabytefilesysteminuse