Big Data Analysis Services in the Cloud - Knowledge Grid · Big Data Analysis Services in the Cloud...
Transcript of Big Data Analysis Services in the Cloud - Knowledge Grid · Big Data Analysis Services in the Cloud...
Big Data Analysis Services in the Cloud
Domenico Talia
DIMES, Università della Calabria & DtoK Lab
Italy
2
§ BigproblemsandBigdata
§ UsingCloudsfordatamining
§ Acollec7onofservicesforscalabledataanalysis§ Dataminingworkflows
§ DataMiningCloudFramework(DMCF)
§ JS4Cloudforprogrammingservice-orientedworkflows.
§ Finalremarks
Talkoutline
3
§ KnowledgeDiscoverymethodsanddataminingtechniquesareusedineveryapplica7ondomaintoextractusefulknowledgefromBigData.
§ DataMiningapplica7onsrangefrom§ Single-taskapplica7ons§ Parameter-sweepingapplica7ons/regularparallelapplica7ons§ Complexapplica7ons(Workflow-based,distributed,parallel).
§ CloudCompu3ngcanbeusedtoprovidedevelopersandend-userswithcompu7ngandstorageservicesandscalableexecu7onmechanismsneededtoefficientlyrunalltheseclassesofapplica3ons.
Goals(1)
4
§ DiscussCloudservicesforscalableexecu7onofdataanalysisworkflows.
§ Presentaprogrammingenvironmentfordataanalysis:DataMiningCloudFramework(DMCF).
§ IntroduceavisualprogramminginterfaceVL4Cloudandthescript-basedJS4Cloudlanguageforimplemen7ngservice-orientedworkflows.
§ EvaluatetheperformanceofdataminingworkflowsonDMCF.
§ Outlinesomeopenresearchissues.
Goals(2)
6
BigDataneedsscalableanalysis
• Vo lume i s on l y onedimensionoftheproblem.
• The most important issueisValue.
• Scalabledata analysis is akey technology to obtainValuefromBigData.
7
BigDataneedsscalableanalysis
Combina7onof• B i g d a t a a n a l i s y s a n dk n o w l e d g e - d i s c o v e r ytechniqueswith
• scalablecompu7ngsystems for
• aneffec3vestrategytoobtainnew insights in a shorterperiodof7me.
• Cloudcompu3nghelps!
9
§ Knowledgediscovery(KDD)anddatamining(DM)apps:§ Needtoruncompute-anddata-intensiveprocesses/tasks§ AreoTenbasedondistribu3onofdata,algorithms,andusers.
§ PaaS(Pla&ormasaService)canbeanappropriatemodeltobuildframeworksthatallowuserstodesignandexecutescalabledataminingapplica7ons.
§ SaaS(So1wareasaService)canbeanappropriatemodeltoimplementscalabledataminingapplica7ons.
§ Thosetwocloudservicemodelscanbeeffec7velyexploitedfordeliveringdataanalysistoolsandapplica3onsasservices.
Dataanalysisasaservice
Servicesfordistributeddatamining
10
§ Byexploi7ngtheSOAmodelitispossibletodefinebasicservicesforsuppor3ngdistributeddataminingtasks/applica3ons.
§ Thoseservicescanaddressalltheaspectsofdataminingandinknowledgediscoveryprocesses• dataselec3onandtransportservices,
• dataanalysisservices,
• knowledgemodelsrepresenta3onservices,and
• knowledgevisualiza3onservices.
Servicesfordistributeddatamining
11
§ Dataminingtasksandapplica7onscanbeofferedashigh-levelservices.
§ AnewwaytodeliverydataanalysissoTwareis
dataanalysisasaservice(DAaaS)
Dataminingasaservice
12
§ Itispossibletodesignservicescorrespondingto
Single KDD Steps
All steps that compose a KDD process such as preprocessing, filtering, and visualization are expressed as services.
Single Data Mining Tasks
Here are included tasks such as classification, clustering, and association rules discovery.
Distributed Data Mining Patterns
This level implements, as services, patterns such as collective learning, parallel classification and meta-learning models.
Data Mining Applications and KDD processes This level includes the previous tasks and patterns composed in multi-step workflows.
Servicesfordistributeddatamining
13
§ Thiscollec7onofdataminingservicesimplementsan
OpenServiceFrameworkforDistributedDataMining
Data Mining Task Services
Distributed Data Mining patterns
Distributed Data Mining Applications and KDD
processes
KDD Step Services
Servicesfordistributeddatamining
14
§ Itallowsdevelopers toprogramdistributedKDDprocessesasacomposi3onofsingleand/oraggregatedservicesavailableoveraservice-orientedinfrastructure.
§ ThoseservicesshouldexploitotherbasicCloudservicesfordatatransfer,replicamanagement,dataintegra7onandquerying.
A Service-oriented Cloud workflow
Servicesfordistributeddatamining
15
§ By exploi7ng the Cloud services features it is possible todevelop data mining services accessible every 3me andeverywhere(remotelyandfromsmalldevices).
§ This approach can produce not only service-based distributeddataminingapplica7ons,butalso
§ Dataminingservicesforcommuni3es/virtualorganiza7ons.
§ Distributeddataanalysisservicesondemand.
§ A sort of knowledge discovery eco-system made by a large
numbersofdecentralizeddataanalysisservices.
DataanalysisonClouds:Systems
[16]
§ Spark-anopen-sourceframeworkforin-memorydataanalysisandmachinelearningdevelopedatUCBerkeley.
§ Sphere/Sector-acomputeservicebuiltontopofSectortoprovideasetofprogramminginterfacesfordistributeddataanalysisapplica7ons.
§ DMCF–theDataMiningCloudFrameworksuppor7ngCloud-baseddataanalysisappsasvisualandscript-basedworkflows.
§ SwiS/T–aworkflow-basedsystemusingfunc7onaldata-driventaskparallelismforscalingdata-intensiveapps.
§ Others:Mahout,HPC-ABDS,CloudFlows,...&commercialsystems.
TheDataMiningCloudFramework
§ TheDataMiningCloudFrameworksupportsworkflow-basedKDD applica;ons, expressed (visually and by a scriptlanguage) as a graphs that link together data sources, dataminingalgorithms,andvisualiza7ontools.
18
CoverType
Sample1
Sample3
Sample2
J48
J48
J48
Sample4
Model1
Model2
Model3 Splitter Voter Model
train
train
train
T2
T3
T4 T5 T1
Architecturecomponents
§ Computeisthecomputa7onalenvironmenttoexecuteCloudapplica7ons:§ Webrole:Web-basedapplica7ons.§ Workerrole:batchapplica7ons.§ VMrole:virtualmachineimages.
§ Storageprovidesscalablestorageelements:§ Blobs:storingbinaryandtextdata.§ Tables:non-rela7onaldatabases.§ Queues:communica7onbetweencomponents.
§ Fabriccontrollerlinksthephysicalmachinesofasingledatacenter:§ ComputeandStorageservicesarebuiltontopofthiscomponent.
19
Compute
Cloud Platform
Website
Web Role instances
Worker
Worker Role instances
Queues
Task Queue
Tables
Task Status Table Storage
Fabric
Blobs
Input datasets
Data mining models
TheDataMiningCloudFramework:Architecture
20
Script-basedworkflows
§ WeextendedDMCFaddingascript-baseddataanalysisprogrammingmodelasamoreflexibleprogramminginterface.
§ Script-basedworkflowsareaneffec7vealterna7vetographicalprogramming.
§ Ascriptlanguageallowsexpertstoprogramcomplexapplica7onsmorerapidly,inamoreconcisewayandwithhigherflexibility.
§ Theideaistoofferascript-baseddataanalysislanguageasanaddi3onalandmoreflexibleprogramminginterfacetoskilledusers.
24
TheJS4Cloudscriptlanguage
§ JS4Cloud(JavaScriptforCloud)isalanguageforprogrammingdataanalysisworkflows.
§ MainbenefitsofJS4Cloud:§ itisbasedonawellknownscrip7nglanguage,sousersdonothaveto
learnanewlanguagefromscratch;
§ itimplementsadata-driventaskparallelismthatautoma7callyspawnsready-to-runtaskstotheavailableCloudresources;
§ itexploitsimplicitparallelismsoapplica7onworkflowscanbeprogrammedinatotallysequen7alway(nouserdu?esforworkpar??oning,synchroniza?onandcommunica?on).
25
JS4Cloudfunc3ons
JS4Cloudimplementsthreeaddi7onalfunc7onali7es,implementedbythesetoffunc7ons:§ Data.get,foraccessingoneoracollec7onofdatasetsstoredintheCloud;§ Data.define, fordefiningnewdataelementsthatwillbecreatedat
run7measaresultofatoolexecu7on;§ Tool, toinvoketheexecu7onofasoTwaretoolavailableintheCloudasa
service.
26
Script-basedapplica3ons
§ Code-definedworkflowsarefullyequivalenttographically-definedones:
Data.define Data.define
Data.define
Data.define
JS4CloudpaXerns
var DRef = Data.get("Customers");var nc = 5;var MRef = Data.define("ClustModel");K-Means({dataset:DRef, numClusters:nc, model:MRef});
Single task
28
JS4CloudpaXerns
Pipeline
var DRef = Data.get("Census");var SDRef = Data.define("SCensus");Sampler({input:DRef, percent:0.25, output:SDRef});var MRef = Data.define("CensusTree");J48({dataset:SDRef, confidence:0.1, model:MRef});
29
JS4CloudpaXerns
Data partitioning
var DRef = Data.get("CovType");var TrRef = Data.define("CovTypeTrain");var TeRef = Data.define("CovTypeTest");PartitionerTT({dataset:DRef, percTrain:0.70, trainSet:TrRef, testSet:TeRef});
30
JS4CloudpaXerns
var DRef = Data.get("NetLog");var PRef = Data.define("NetLogParts", 16);Partitioner({dataset:DRef, datasetParts:PRef});
Data partitioning
31
JS4CloudpaXerns
var M1Ref = Data.get("Model1");var M2Ref = Data.get("Model2");var M3Ref = Data.get("Model3");var BMRef = Data.define("BestModel");ModelChooser({model1:M1Ref, model2:M2Ref, model3:M3Ref, bestModel:BMRef});
Data aggregation
32
JS4CloudpaXerns
var MsRef = Data.get(new RegExp("^Model"));var BMRef = Data.define("BestModel");ModelChooser({models:MsRef, bestModel:BMRef});
Data aggregation
33
JS4CloudpaXerns
var TRef = Data.get("TrainSet");var nMod = 5;var MRef = Data.define("Model", nMod);var min = 0.1;var max = 0.5;for(var i=0; i<nMod; i++) J48({dataset:TRef, model:MRef[i], confidence:(min+i*(max-min)/(nMod-1))});
Parameter sweeping
34
JS4CloudpaXerns
var nMod = 16;var MRef = Data.define("Model", nMod);for(var i=0; i<nMod; i++) J48({dataset:TsRef[i], model:MRef[i], confidence:0.1});
Input sweeping
35
Monitoringinterface
§ Asnapshotoftheapplica7onduringitsexecu7onmonitoredthroughtheprogramminginterface.
37
Exampleapplica3ons(1)
Finance: Predic7on of personalincomebasedoncensusdata
Networks: Discovery ofnetwork a]acks from loganalysis.
E-Health: Disease classifica7onbasedongeneanalysis
38
Exampleapplica3ons(2)
Smart City: Car trajectory pa]erndetec7onapplica7ons.
Biosciences:drugmetabolismassocia7onsinpharmacogenomics.
39
KDDCup99example
• Input dataset: 46 million tuples• Used Cloud: up to 64 virtual servers (single-core
1.66 GHz CPU, 1.75 GB of memory, and 225 GB of disk)
40
Anotherapplica3onexample
• Ensemblelearningworkflow(geneanalysisforclassifyingcancertypes)
• Turnaround7me:162minuteson1server,11minuteson19servers.• Speedup:14.8
43
TrajectorypaXerndetec3on
§ Analyzetrajectoriesofmobileuserstodiscovermovementpa]ernsandrules.
§ Aworkflowthatintegratesfrequentregions
detec7on,trajectorydatasynthe7za7onandtrajectorypa]ernextrac7on.
44
45
FrequentRegionsDetec3on• Detectareasmoredenselypassedthrough• Density-based clustering algorithm (DB-
Scan)• Furtheranalysis:movementthroughareas
TrajectoryDataSynthe3za3on• each point is subs7tuted by the dense region it
belongsto.• trajectory representa7ons is changed from
movements between points into movementsbetweenfrequentregions
TrajectoryPaXernExtrac3on• Discovery of pa]erns from structured
trajectories• T-Apriori algorithm, i.e. ad-hoc modified
versionofApriori
Applica3onmainsteps
47
Workflowimplementa3on
§ DMCFvisualworkflowimplemen7ngthetrajectorypa]erndetec7onalgorithm§ Eachnoderepresentseitheradatasourceoradataminingtool§ Eachedgerepresentsanexecu7ondependencyamongnodes§ Somenodesarelabeledbythearraynota7on
§ Compactwaytorepresentmul7pleinstancesofthesamedatasetortool§ Veryusfeultobuildcomplexworkflows(data/taskparallelism,parametersweeping,etc.)
128paralleltasks!
49
n vsseveraldatasizes(upto1287mestamps),fordifferentnumberofservers
w itpropor7onallyincreaseswiththetheinputsize
w itpropor7onallydecreaseswiththeincreaseofcompu7ngresources
n vsthenumberofservers(upto64),fordifferentdatasizes
w comparisonparallel\sequen7alexecu7on
w D16(D128):itreducesfrom8.3(68)hourstoabout0.5(1.4)hours
Experimentalevalua3on
Turnaround3me
68 hours (3 days)
1.4 hours
49
50
n speed-up
w notabletrend,uptothecaseof16nodes
w goodtrendforhighernumberofnodes(influenceofthesequen7alsteps)
n scale-up
w comparable7meswhendatasizeand
#serversincreasepropor7onallyw DBSCANstep(parallel)takesmostofthe
total7mew othersteps(sequen7al)increaseswith
largerdatasets
Experimentalevalua3on
Scalabilityindicators
14.5
49.3
50
Finalremarks
§ Dataminingandknowledgediscoverytoolsareneededtosupportfindingwhatisinteres3ngandvaluableinBigData.
§ Cloudcompu7ngsystemscaneffec7velybeusedasscalableplaiormsforservice-orienteddatamining.
§ Designandprogrammingtoolsareneededforsimplicityandscalabilityofcomplexdataanalysisprocesses.
§ TheDMCFanditsprogramminginterfacessupportusersinimplemen7ngandrunningscalabledatamining.
52
Bigisquiteamovingtarget?
53
ü Yes,butnotonly for the increasingsizeofdata.
ü Bigisamisleadingterm.
ü It must include complexity (anddifficulty)ofhandlinghugeamountsofheterogeneousdata.
ü Inthefirstreport(2001)onBigDatathetermBigwasnotactuallyused.
Bigisquiteamovingtarget?Someexample
54
§ Somedatachallengesexampleswefacetoday§ Scien7ficdataproducedatarateofhundredsofgigabits-per-secondthatmustbestored,filteredandanalyzed.
§ Tenmillionsofimagesperdaythatmustbemined(analyzed)inparallel.
§ Onebillionoftweets/postsqueriedinreal-7meonanin-memorydatabase.
Arethechallengesfacedtodaydifferentfromthechallengesfaced10yearsago?
55
§ Yes, because data sources aremuchmorethan10yearsago.
§ Yes, because we want to solvemorecomplexproblems.
§ No,becauseindataminingwes7llwork on data produced withdifferentgoals.
§ No, because computers wereinventedtoprocessdataquickly.
IsSize/Volumethemostimportantissueindataanalysis?
56
§ No , volume is only onedimensionoftheproblem.
§ The most important issue isValue.
§ Size and complexity representthe problem,Value is the realbenefit.
Smartalgorithmsandscalablesystems
62
• HPCsystemsandCloudsequippedwithdataanalysistoolsarebecomingthemostused(anduseful)plaiormsforBigDataanalysis.
• Newwaystoefficientlycomposedifferentdistributeddataminingmodelsandtoolsareneededfornewplaiorms.
• MAINISSUES:Dataminingalgorithms,toolsandapplica7onsmustbedesignedandportedonsuchplaiormsfordevelopingextremedatadiscoverysolu7ons.
ScalableDataanalysis:OpenResearchIssues
63
• Programmingabstractsforbigdataanaly?cs.TheMapReduceandtheworkflowmodelsareoTenusedonHPCandclouds,butmoreresearchworkisneededtodevelopotherscalable,adap7ve,general,higher-levelabstractprogrammingstructures&tools.
• Dataandtoolintegra?onandopenness.Codecoordina7onanddataintegra7onaremainissuesinlarge-scaleapplica7onsthatusedataandcompu7ngresources.Standardformats,dataexchangemodelsandcommonAPIsareneeded.
• Interoperabilityofbigdataanaly?csframeworks.Cloudserviceparadigmsmustbedesignedtoallowworldwidefedera7onandintegra7onofmul7pledataanaly7csframeworksandservices.
Scalabledataanalysis:OpenResearchIssues
64
• Exascalecompu?ngsystemsrepresentthenextcompu?ngstepalso(inpar7cular)intheBigDatafield.
• Thosesystemsraiseveryimportantresearchchallengesunderinves7ga7on
withthegoalof
• Buildingscalablehw/swsystemscomposedofaverylargenumberofmul7-coreprocessorsexpectedtodeliveraperformanceof1018opera7onspersecond.
… the potential interoperability and scaling convergence of HPC computing and data analysis is crucial to the future.
D.A. Reed & J. Dongarra, CACM 2015
Ongoing&futurework
§ DtoKLabisastartupthatoriginatedfromourworkinthisarea.
www.dtoklab.com
§ Nuby3csisdeliveredonpubliccloudsasahigh-performanceSoTware-as-a-Service(SaaS)tosupportinnova7vedataanalysistoolsandapplica7ons.
§ Applica7onsintheareaofsocialdataanalysis,urbancompu3ng,airtrafficandothershavebeendevelopedbyJS4Cloud.
65
Somepublica3ons
§ D.Talia,P.Trunfio,F.Marozzo,DataAnalysisintheCloud,Elsevier,USA,2015.
§ D. Talia, P. Trunfio,Service-oriented distributedknowledgediscovery,CRCPress,USA,2012.
§ D.Talia,“CloudsforScalableBigDataAnaly3cs”,IEEEComputer,46(5),pp.98-101,2013
§ F. Marozzo, D. Talia, P. Trunfio, "A WorkflowManagement System for Scalable Data Miningon Clouds”, IEEE Transac?ons on ServiceCompu?ng,2016(toappear).
§ L. Belcastro, F. Marozzo, D. Talia, P. Trunfio,"UsingScalableDataMiningforPredic3ngFlightDelays”, ACM Transac?ons on IntelligentSystemsandTechnology(TIST),vol.8no.1,July2016.
66