Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just...

25
Using big data in prac.ce: Computa.onal infrastructures Gianluca Demar-ni Informa-on School University of Sheffield

Transcript of Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just...

Page 1: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Usingbigdatainprac.ce:Computa.onalinfrastructures

GianlucaDemar-niInforma-onSchool

UniversityofSheffield

Page 2: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

BigData

•  DefinedasVs– Volume:Justaboutsize,Giga,Tera,Petabytes– Variety:Formats,text,databases,pictures,excel– Velocity:Speed,10000tweetspersecond,2000picturesonInstagrampersecond

2

Page 3: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Dataishuge(Volume)

•  Facebook– processes750TB/dayofdata– adds7PBofphotostorage/month

•  Thisrequirescomputers(alotofthem)

•  Notonlyinternetcompanies!– Banks,packagedelivery,governments,shops,etc.

3

Page 4: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Dataisfast(Velocity)

•  TwiVerfirehose–  In2011,1000Tweetspersecond(TPS)–  In2014,20000TPS– Withpeaks:143KTPS

•  Servicesontop– DataSiZ:aggregate,filterandextractinsights

•  Notonlyinternetcompanies!– Stockexchange,sensorsinwaternetwork,etc.

4

Page 5: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Scale-upvsScale-out

•  Scale-up–  Increasingthepowerofyourcomputer(i.e,disk,memory,processor)

•  Scale-out– Usemanystandardcomputersanddistributedataandcomputa-onoverthem

5

Page 6: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

FacebookDataCenter(Sweden)

6

Page 7: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Data

7

Page 8: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Fundamentalwork

•  GoogleFileSystem,2003– accesstodatausinglargeclustersofcommoditymachines

•  BigTable,2003-2006– datastoragesystem– DistributedmapKey->Value

•  Map/Reduce,2004– Programmingparadigmoveraclusterofmachines

8

Page 9: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Open-Sourceanalogous

•  HDFS(HadoopFileSystem)– DistributedFileSystem

•  ApacheHbasehVp://hbase.apache.org/– Distributeddatabase

•  ApacheHadoophVp://hadoop.apache.org/– Distributedcomputa-on

C.L.PhilipChen,Chun-YangZhang,Data-intensiveapplica-ons,challenges,techniquesandtechnologies:AsurveyonBigData,Informa-onSciences,Volume275,10August2014,Pages314-347,ISSN0020-0255,hVp://dx.doi.org/10.1016/j.ins.2014.01.015.(hVp://www.sciencedirect.com/science/ar-cle/pii/S0020025514000346)

9

Page 10: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

HadoopDistributedFileSystem(HDFS)

•  InspiredbyGoogleFileSystem•  Scalable,distributed,portablefilesystemwriVeninJavaforHadoopframework

•  PrimarydistributedstorageusedbyHadoopapplica-ons

•  HFDScanbepartofaHadoopclusterorcanbeastand-alonegeneralpurposedistributedfilesystem

•  Reliabilityandfaulttoleranceensuredbyreplica-ngdataacrossmul-plehosts

•  Zookeeperforthedistributedcoordina-on

10

Page 11: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

MapReduce(MR)

•  High-levelprogrammingmodelandimplementa-onforlarge-scaleparalleldataprocessing

•  Commodityhardware•  Fault-tolerant•  Currently,themostoverhypedsysteminCS

11

Page 12: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

MRDataModel

•  Files!

•  Eachfileasetof(key,value)pairs

•  Amap-reduceprogram:–  Input:asetof(inputkey,value)pairs– Output:asetof(outputkey,value)pairs

k1->v1k2->v2k3->v3

12

FirstName->JohnLastName->SmithRole->VPSteps->12k

Page 13: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Step1:theMAPPhase

•  UserprovidestheMAPfunc-on:–  Input:one(inputkey,value)– Output:asetof(intermediatekey,value)pairs

•  Systemappliesmapfunc-oninparalleltoallinputpairs

13

Page 14: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Step2:REDUCEphase

•  UserprovidestheREDUCEfunc-on:–  Input:intermediatekey,andsetofvalues– Output:setofoutputvalues

•  SystemgroupsallpairswithsamekeyandpassesvaluestotheREDUCEfunc-on

14

Page 15: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

MRJob

•  Thereisonemasternode•  Masterpar..onsinputfileintoMsplits,bykey•  Masterassignsworkers(=servers)totheMmaptasks,keepstrackoftheirprogress

•  Workerswritetheiroutputtolocaldiskpar--onintoRregions

•  MasterassignsworkerstotheRreducetasks•  Reduceworkersreadregionsfromthemapworkers’localdisks

15

Page 16: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

MRJob

Page 17: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Example:MRwordlengthcount

17

Page 18: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Example:MRwordlengthcount

18

Page 19: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Example:MRwordlengthcount

19

Page 20: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

MapReduce(MR)tools

•  MRimplementa-on

•  MRQueryLanguage

•  MRQueryEngine

20

Page 21: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

PigLa-n

•  High-levellanguage•  SQL-like•  Verylargedatasets,translatedinMapandReduceTasks

O'Reilly®ProgrammingPig,byAlanGates 21

Page 22: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Commercialproducts

•  Hadoopascatch-allbig-datasolu-on– Smallclusters(10sofmachines)– Massivein-houseclusters– PublicCloud

•  Hortonworks– ConsultancycompanyforApacheHadoop

•  Cloudera•  HPVer-caAnaly-csPlaporm

22

Page 23: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

StreamprocessingBigDatatools

•  BatchvsStreamprocessing– Dataislargebutalwaysavailableondisk– Dataisarrivingfastandcannotbestored/needstobeprocessedimmediately

•  Streamsofdata–  TwiVer–  InternetofThings–  SmartCi-es– Nuclearpowerplant

23

Page 24: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

ApacheSpark

•  hVps://spark.apache.org/databricks.com

•  Generalsystemforlarge-scaledataprocessing•  Faster(in-memory),interac-ve•  Version1.6releasedinMarch2016

24

Page 25: Using big data in prac.ce: Computaonal infrastructuresBig Data • Defined as Vs – Volume: Just about size, Giga, Tera, Petabytes – Variety: Formats, text, databases, pictures,

Conclusion

•  Ifitfitsinmemory,it’snotbigdata•  Ifit’sbigdata,youcannotuseExcel/SPSS/R•  Thewaytogoisscale-outanddistributedcomputa-on

•  Map/Reduceforbatchprocessing•  HadooporSparkasout-of-the-boxsystems– Alsowithhigh-levelqueryinterfaceslikePig/R/Python