Interactive Scientific Image Analysis using Spark
of 44
/44
-
Author
kevin-mader -
Category
Technology
-
view
5.479 -
download
0
Embed Size (px)
Transcript of Interactive Scientific Image Analysis using Spark
- 1. SUMMIT EAST SUMMIT EAST InteractiveScientificImage AnalysisandAnalyticsusingSpark KevinMader SparkEast,NYC,19March2015
- 2. SUMMIT EAST Outline Background:OurTechnique(whywe havebigdata) X-RayTomographicMicroscopy Imagingin2015 TheProblem(s) TheTools SparkImagingLayer 3DImaging HyperspectralImaging InteractiveAnalysis/Streaming TheScience GenomeScaleStudies LargeDatasets Outlook/Developments
- 3. SUMMIT EAST Synchrotron-basedX-RayTomographicMicroscopy Theonlytechniquewhichcandoall peerdeepintolargesamples achieve isotropicspatial resolution with1.8mmfieldofview achieve>10Hztemporalresolution 8GB/sofimages [1]Moksoetal.,J.Phys.D,46(49),2013 < 1m CourtesyofM.PistoneatU.Bristol
- 4. SUMMIT EAST ImageSciencein2015:Moreandfaster X-Ray SwissLightSource(SRXTM)imagesat (>1000fps) 8GB/s,diffraction patterns(cSAXS)at30GB/s Nanoscopium(Soleil),10TB/day,10- 500GBfilesizes,veryheterogenousdata Optical Light-sheetmicroscopy(see of JeremyFreeman)producesimages 500MB/s High-speedconfocalimagesat(>200fps) 78Mb/s Geospatial Newsatelliteprojects(Skybox,etc)will measurehundredsofterabytesto petabytesofimagesayear talk Personal GoPro4Black-60MB/s(3840x2160x 30fps)for$600 -400MB/s(640x480x840fps) for$400 fps1000
- 5. SUMMIT EAST HowmuchisaTB,really? Ifyoulookedatone1000x1000sized imageeverysecond Itwouldtakeyou 139hourstobrowsethroughaterabyteof data. Year Timeto1 TB Manpowerto keepup SalaryCosts/ Month 2000 4096min 2people 25kCHF 2008 1092min 8people 95kCHF 2014 32min 260people 3255kCHF 2016 2min 3906people 48828kCHF
- 6. SUMMIT EAST Computinghaschanged:Parallel MooresLaw Basedondatafrom Transistors 2T/(18 months) https://gist.github.com/humberto- ortiz/de4b3a621602b78bf90d Therearenowmanymoretransistorsinside asinglecomputerbuttheprocessingspeed hasn'tincreased.Howcanthisbe? MultipleCore Manymachineshavemultiplecores foreachprocessorwhichcanperform tasksindependently MultipleCPUs Morethanonechipiscommonly present Newmodalities GPUsprovidemanycoreswhich operateatslowspeed ParallelCodeisimportant
- 7. SUMMIT EAST CloudComputingCosts Thefigureshowstherangeofcloudcosts (determinedbypeakusage)comparedtoa localworkstationwithutilizationshownas theaveragenumberofhoursthecomputer isusedeachweek. Thefigureshowsthecostofacloudbased solutionasapercentageofthecostof buyingasinglemachine.Thevaluesbelow1 showthepercentageasanumber.The panelsdistinguishtheaveragetimeto replacementforthemachinesinmonths
- 8. SUMMIT EAST TheProblem Thereisafloodofnewdata WhattookanentirePhD3-4yearsago,cannowbemeasuredinaweekend,orevenseveral seconds.Analysistoolshavenotkeptup,aredifficulttocustomize,andusuallyhighly specific. OptimizedData-Structuresdonotfit Data-structuresthatwerefastandefficientforcomputerswith640kbofmemorydonot makesenseanymore Single-corecomputingistooslow CPU'sarenotgettingthatmuchfasterbuttherearealotmoreofthem.Iteratingthrougha hugearraytakesalmostaslongon2014hardwareas2006hardware
- 9. SUMMIT EAST ExploratoryImageProcessingPriorities Correctness Themostimportantjobforanypieceof analysisistobecorrect. Apowerfultestingframeworkis essential Avoidrepetitionofcodewhichleadsto inconsistencies Usecompilerstofindmistakesrather thanusers Easilyunderstood,changed,and used Almostallimageprocessingtasksrequirea numberofpeopletoevaluateand implementthemandarealmostalways movingtargets Flexible,modularstructurethatenables Fast Thelastofthemajorprioritiesisspeed whichcoversbothscalability,raw performance,anddevelopmenttime. Longwaitsforprocessingdiscourages exploration Manualaccesstodataonsepareatedisks isahugespeedbarrier Real-timeimageprocessingrequires millisecondlatencies Implementingnewideascanbedone quickly
- 10. SUMMIT EAST TheFrameworkFirst Ratherthanbuildingananalysisas quicklyaspossibleandthentryingto hackittoscaleuptolargedatasets chosetheframeworkfirst thenstartmakingthenecessarytools. Google,Amazon,Yahoo,andmanyother companieshavemadehugein-roadsinto theseproblems Therealneedisafast,flexible frameworkforrobustly,scalably performingcomplicatedanalyses,asort ofExcelforbigimagingdata. ApacheSparkandHadoop2 Thetwoframeworksprovideafreeoutof theboxsolutionfor scalingto>10000computers storingandprocessingexabytesofdata faulttolerance 2/3rdsofcomputerscancrashanda requeststillaccuratelyfinishes hardwareandsoftwareplatform indpendence(Mac,Windows,Linux)
- 11. SUMMIT EAST Spark->Microscopy? TheseframeworksarereallycoolandSpark hasabigvocabulary,butflatMap,filter, aggregate,join,groupBy,andfoldstilldonot soundlikeanythingIwanttodotoan image. Iwantto filteroutnoise,segment,chooseregions ofinterest contour,componentlabel measure,count,andanalyze SparkImageLayer Developedat , ,and TheSparkImageLayerisaDomain SpecificLanguageforMicroscopyfor Spark. Itconvertscommonimagingtasksinto coarse-grainedSparkoperations 4Quant ETHZurich PaulScherrerInstitut
- 12. SUMMIT EAST SparkImageLayer Wehavedevelopedanumberofcommands forSILhandlingstandardimageprocessing tasks Fullyexensiblewith
- 13. SUMMIT EAST Usecase:HyperspectralImaging Hyperspectralimagingisarapidlygrowing areawiththepotentiallyformassive datasetsandaseveredeficitofusuable tools. Thescaleofthedataislargeandstandard imageprocessingtoolsareill-suitedfor handlingthem,althoughtheideasusedin imageprocessingareequallyapplicableto hyperspectraldata(filtering,thresholding, segmentation,)anddistributed,parallel approachesmakeevenmoresenseonsuch massivedatasets
- 14. SUMMIT EAST FlexibilitythroughTypes DevelopinginScalabringsadditional flexibilitythroughtypes[1],withmicroscopy thestandardformatsare2-,3-andeven4- ormoredimensionalarraysormatrices whichcanbeiteratedthroughquicklyusing CPUandGPUcode.Whilestillpossiblein Scala,thereisagreatdealmoreflexibility fordatatypesallowinganythingtobe storedasanimageandthenprocessedas longasbasicfunctionsmakesense. [1]FightingBitRotwithTypes(Experience Report:ScalaCollections),MOdersky, FSTTCS2009,December2009 Whatisanimage? Acollectionofpositionsandvalues,maybe more(notanarrayofdouble).Arraysare efficientforstoringincomputermemory, butoftenapoorwayofexpressingscientific ideasandanalyses. FilterNoise? combine information from nearby pixels Findobjects determine groups of pixels which are very similar to desired result
- 15. SUMMIT EAST MakingCodingSimplerwithTypes trait BasicMathSupport[T] extends Serializable { def plus(a: T, b: T): T def times(a: T, b: T): T def scale(a: T, b: Double): T def negate(a: T): T = scale(a,-1) def invert(a: T): T def abs(a: T): T def minus(a: T, b: T): T = plus(a, negate(b)) def divide(a: T, b: T): T = times(a, invert(b)) def compare(a: T, b: T): Int }
- 16. SUMMIT EAST ContinuingwithTypes Simplefilterimplementation Spectraaswellsupportedtypes def SimpleFilter[T](inImage: Image[T]) (implicit val wst: BasicMathSupport[T]) = { val width: Double = 1 kernel = (pos: D3int,value: T) => value * exp(- (pos.mag/width)**2) kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5 runFilter(inImage,kernel,kernelReduce) } implicit val SpectraBMS = new BasicMathSupport[Array[Double]] { def plus(a: Array[Double], b: Array[Double]) = a.zip(b).map(_ + _) ... def scale(a: Array[Double], b: Double) = a.map(_*b)
- 17. SUMMIT EAST InteractiveAnalysis Combiningmanydifferentcomponents togetherinsideoftheSparkShell,IPython orZeppelin,makeiteasiertoassemble workflows
- 18. SUMMIT EAST ScientificCases:Genome-scaleImaging Wewanttounderstandtherelationship betweengeneticbackgroundandbone structure Withexistingtools,analysisispossible andanumberofpublicationshavebeen made,evenonesthatshowdifferences betweenstrainsofmice But n30s/object 30-40kobjectspersample OneSamplein6.25weeks 2014approach-1.5years ImageJmacroforsegmentation(2-4 hours/sample) Pythonscriptforshapeanalysis(3hours /sample) Paraviewmacrofornetworkand connectivity(2hours/sample) Pythonscripttopoolresults(3-4hours) MySQLDatabasestoringresults(5 minutes/query)
- 20. SUMMIT EAST GeneticStudiesusingSparkImageLayer Analysiscouldbecompletedinseveralmonths(insteadof120years,couldnowbe completedindaysinthecloud) Datacanbefreelyexploredandanalyzed val bones = sc.loadImages("work/f2_bones/*/bone.tif") Segmenthardandsofttissues Labelcells Exportresults val hardTissue = bones.threshold(OTSU) val softTissue = hardTissue.invert val cells = hardTissue.componentLabel. filter(c=>c.size>100 & c.size1000).map(sampleToPath).joinByKey(bones) Seeimmediatelyindatasetsofterabyteswhichimagehadthelargestcells Newhypothesesandanalysescanbedoneinseconds/minutes Task SingleCoreTime SparkTime(40cores) LoadandPreprocess 360minutes 10minutes SingleColumnAverage 4.6s 400ms 1K-meansIteration 2minutes 1s
- 22. SUMMIT EAST ScienceProblems:FullBrainImaging CollaborationwithA.AstolfoandA. Patera Measureafullmousebrain(1cm )with cellularresolution(1 m) 10x10x10scansat2560x2560x2160 14TVoxels 0.000004%oftheentiredataset 3 14TVoxels=56TB Eachscanneedstoberegisteredand alignedtogether Therearenocomputerswith56TBof memory Evenmultithreadedapproachsarenot feasibleandrequiremanylogistics Analysisofthestitcheddataisalsoof interest(segmentation,vesselanalysis, distributionandnetworkconnectivity)
- 23. SUMMIT EAST ScienceProblems:BigStitching Images : RDD[((x, y, z), Img[Double])] = [( , Img), ]x dispField = Images. cartesian(Images).map{ case ((xA,ImA), (xB,ImB)) => xcorr(ImA,ImB,in=xB-xA) }
- 24. SUMMIT EAST FromMatchingtoStitching Fromtheupdatedinformationprovidedby thecrosscorrelationsandbyapplying appropriatesmoothingcriteria(if necessary). Thestitchingitself,ratherthanrewriting theoriginaldatacanbedoneinalazy fashionascertainregionsoftheimageare read. Thisalsoensurestheoriginaldataisleft unalteredandallanalysisisreversible. def getView(tPos,tSize) = stImgs. filter(x=>abs(x-tPos) val oImg = new Image(tSize) oImg.copy(img,x,tPos) }.addImages(AVG)
- 25. SUMMIT EAST ViewingRegions getView(Pos(26.5,13),Size(2,2))
- 26. SUMMIT EAST Real-timewithSparkStreaming:Webcam Inthebiologicalimagingcommunity,the opensourcetoolsofImageJ2andFijiare widelyacceptedandhavealargenumberof readilyavailablepluginsandtools. Wecanintegratethefunctionalitydirectly intoSparkandperformoperationsonmuch largerdatasetsthanasinglemachinecould haveinmemory.Additionallythese analysescanbeperformedonstreaming data.
- 27. SUMMIT EAST StreamingAnalysisReal-timeWebcamProcessing Filterimages Createabackgroundimage val wr = new WebcamReceiver() val ssc = sc.toStreaming(strTime) val imgList = ssc.receiverStream(wr) val filtImgs = allImgs.mapValues(_.run("Median...","radius=3")) val totImgs = inImages.count() val bgImage = inImages.reduce(_ add _).multiply(1.0/totImgs)
- 28. SUMMIT EAST IdentifyOutliersinStreams Removethebackgroundimageandfindthemeanvalue Showtheoutliers val eventImages = filtImgs. transform{ inImages => val corImage = inImages.map { case (inTime,inImage) => val corImage = inImage.subtract(bgImage) (corImage.getImageStatistics().mean, (inTime,corImage)) } corImage } eventImages.filter(iv => Math.abs(iv._1)>20). foreachRDD(showResultsStr("outlier",_))
- 29. SUMMIT EAST StreamingDemowithWebcam
- 30. SUMMIT EAST Asascientist(notadata-scientist) ApacheSparkisbrilliantplatformand utilizingGraphX,MLLib,andotherpackages thereunlimitedpossibilities Scalacanbeabeautifulbutnoteasy language Pythonisaneasierlanguage Bothsufferfrom Non-obviousworkflows Scriptsdependingonscripts dependingonscripts(canbevery fragile) Althoughallanalysescanbeexpressed asaworkflow,thisisoftendifficulttosee fromthecode Non-technicalpersonshavelittleability tounderstandormakeminor adjustmentstoanalysis Parametersrequirerecompilingto change orGUIsneedtobeplacedontop
- 31. SUMMIT EAST Abasicimagefilteringoperation ThankstoSpark,itiscached,inmemory,approximate,cloud-ready ThankstoMap-Reduceitisfault-tolerant,parallel,distributed ThankstoJava,itishardwareagnostic Butitisalsonotreallysoreadable def spread_voxels(pvec: ((Int,Int),Double), windSize: Int = 1) = { val wind=(-windSize to windSize) val pos=pvec._1 val scalevalue=pvec._2/(wind.length*wind.length) for(x 1.0)
- 40. SUMMIT EAST CommonFeatures Thedistancewhileillustrativeisnota commonlyusedfeatures,morecommon variousfiltersappliedtotheimage GaussianFilter(informationonthe valuesofthesurroundingpixels) Sobel/CannyEdgeDetection (informationonedgesinthevicinity) Entroy(informationonvariabilityin vicinity) x y Intensity Sobel Gaussian 1 1 0.94 0.32 0.53 1 10 0.48 0.50 0.45 1 11 0.50 0.50 0.46 1 12 0.48 0.64 0.46 1 13 0.43 0.78 0.45 1 14 0.33 0.94 0.42
- 41. SUMMIT EAST Analyzingthefeaturevector Thedistributionsofthefeaturesappear verydifferentandcanthuslikelybeused foridentifyingdifferentpartsoftheimages. Combinethiswithouraprioriinformation (calledsupervisedanalysis)
- 42. SUMMIT EAST UsingMachineLearning Nowthattheimagesarestoredasfeature vectors,theycanbeeasilyanalyzedwith standardMachineLearningtools.Itisalso mucheasiertocombinewithtraining information. x y Absorb Scatter Training 700 4 0.3706262 0.9683849 0.0100140 704 4 0.3694059 0.9648784 0.0100140 692 8 0.3706371 0.9047878 0.0183156 696 8 0.3712537 0.9341989 0.0334994 700 8 0.3666887 0.9826912 0.0453049 704 8 0.3686623 0.8728824 0.0453049 WanttopredictTrainingfromx,y, Absorb, and Scatter MLLib: LogisticRegression,RandomForest,K- NearestNeighbors,
- 43. SUMMIT EAST BeyondImageProcessing Formanydatasetsprocessing, segmentation,andmorphologicalanalysisis alltheinformationneededtobeextracted. Formanysystemslikebonetissue,cellular tissues,cellularmaterialsandmanyothers, thestructureisjustthebeginningandthe mostinterestingresultscomefromthe applicationtophysical,chemical,or biologicalrulesinsideofthesestructures. = m j Fij xi Suchsystemscanbeeasilyrepresentedbya graph,andanalyzedusingGraphXina distributed,faulttolerantmanner.
- 44. SUMMIT EAST HadoopFilesystem(HDFSnotHDF5) Bottleneckisfilesystemconnection,many nodes(10+)readinginparallelbringseven GPFS-basedinfinibandsystemtoacrawl OneofthecentraltenantsofMapReduce isdata-centriccomputation insteadof datatocomputation,movethecomputation tothedata. Usefastlocalstorageforstoring everythingredundantly lesstransfer andfault-tolerance Largestfilesize:512yottabytes,Yahoo has14petabytefilesysteminuse