Hadoop Ecosystem Overview - Inspiring...
Transcript of Hadoop Ecosystem Overview - Inspiring...
Agenda
• IntroduceHadoopprojectstoprepareyouforyourgroupwork– In>matedetailwillbeprovidedinfuturelectures
• Discusspoten>alusecasesforeachproject
Topics• HDFS• MapReduce• YARN• Sqoop• Flume• NiFi• Pig• Hive• Streaming• HBase• Accumulo• Avro
• Parquet• Mahout• Oozie• Storm• ZooKeeper• Spark• SQL-on-Hadoop• In-MemoryStores• Cassandra• KaWa• Crunch• Azkaban
HDFS
• HadoopDistributedFileSystem– High-performancefilesystemforstoringdata
• We’vetalkedaboutthisenough
HadoopMapReduce
• High-performancefault-tolerancedataprocessingsystem
• We’vealsotalkedaboutthisenough
YARN• Abstractframeworkfordistributedapplica>ondevelopment
• Splitfunc>onalityofJobTrackerintotwocomponents– ResourceManager– Applica>onMaster
• TaskTrackerbecomesNodeManager– Containersinsteadofmapandreduceslots
• ConfigurableamountofmemoryperNodeManager
MapReduce2.xonYARN
• MapReduceAPIhasnotchanged– Binary-levelbackwardscompa>ble(norecompile)
• Applica>onMasterlaunchesandmonitorsjobviaYARN
• MapReduceHistoryServertostore…history
• EnabledYahoo!toscalebeyond4,000nodes
HadoopEcosystem
• CoreTechnologies– HadoopDistributedFileSystem– HadoopMapReduce
• Manyothertools…– Whichwewillbediscussing…now
ApacheSqoop
• ApacheprojectdesignedforefficienttransferbetweenApacheHadoopandstructureddatastores
• UsethroughCLIandextendable
• Usecases?
ApacheFlume
• Distributed,reliable,availableserviceforcollec>ng,aggrega>ng,andmovinglargeamountsoflogdata
• Configureagentsusingsimplefiles,extendable
• Usecases?
ApacheNiFi
• Aservicetoreliablymoveandmanipulatefilesbetweenclustersusingawebfront-end
• UsesaGUItodropprocessorsandconnectthemtobuildworkflows
• Usecases?
ApachePig
• Plahormforanalyzinglargedatasetsthatconsistsofahigh-levellanguageforexpressingdataanalysisprograms
• InfrastructurecompileslanguagetoasequenceofMapReduceprograms
• Usecases?
ApacheHive
• Datawarehousefacilita>ngqueryingandmanaginglargedatasets
• CompilesSQL-likequeriesintoMapReduceprograms
• Usecases?
HadoopStreaming
• U>litytocreateandrunMapReducejobswithanyexecutableorscriptasthemapperorreducer
• Justajarfile,notarealproject
• Usecases?
ApacheHBase
• Distributed,scalable,bigdatastore• Datastoredassortedkey/valuepairs,withthekeyconsis>ngofarowandcolumn
• Usecases?
ApacheAccumulo
• Robust,scalable,high-performancedatastorageandretrievalkey/valuestore
• Cell-basedaccesscontrols– i.e.cell-levelsecurity
• Usecases?
ApacheMahout
• MachinelearninglibrarytobuildscalablemachinelearningalgorithmsimplementedontopofHadoopMapReduce
• Usecases?
ApacheStorm
• Distributedreal->mecomputa>onsystem• Didn’thavealogoun>lJune2014
• HowisthisdifferentthanMapReduce?• Usecases?
ApacheZooKeeper
• Efforttodevelopandmaintainandopen-sourceserverenablinghighlyreliabledistributedcoordina>on
• Usecases?
ApacheSpark
• Fastandgeneralengineforlarge-scaledataprocessing
• Writeapplica>onsinJava,Scala,orPython
• Usecases?
SQLonHadoop
• ApacheDrill,ClouderaImpala,Facebook’sPresto,Hortonworks’sHiveS>nger,PivotalHAWQ,etc.
• SQL-likeorANSISQLcompliantMPPexecu>onenginesusingHDFSasadatastore
• Usecases?Nonusecases?
SampleArchitecture
HDFS
FlumeAgent
FlumeAgent
FlumeAgent
MapReduce Pig HBase Storm
Website
OozieWebserver
Sales
CallCenter SQL
SQL
ApacheCassandra
• NoSQLdatabaseformanaginglargeamountsofstructured,semi-structured,andunstructureddata
• Supportforclustersspanningmul>pledatacenters• UnlikeHBaseandAccumulo,dataisnotstoredonHDFS
• Usecases?Nonusecases?
ApacheCrunch
• Javaframeworkforwri>ng,tes>ng,andrunningMapReducepipelineswithasimpleAPI
• Samecodeexecutesasalocaljob,asaMapReducejob,orasastreamingSparkjob
• Usecases? *
*Notthereallogo,buttrulyfantas3c
Review
• Alotofprojectsavailabletoyouforyourgrouproject
• Thinkofaproblemyouareinterestedin,thenchoosetheappropriateprojectstosolveit
• Keepinminddataingest,storage,processing,andegress
• FeelfreetoexploreanduseotherprojectsthantheonesIhavelistedhere– Getpermissionifyouplanonusingitaspartofyourprojectquota