the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents...

518

Transcript of the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents...

Page 1: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,
Page 2: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LearningHadoop2

Page 3: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TableofContents

LearningHadoop2

Credits

AbouttheAuthors

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.Introduction

Anoteonversioning

ThebackgroundofHadoop

ComponentsofHadoop

Commonbuildingblocks

Storage

Computation

Bettertogether

Hadoop2–what’sthebigdeal?

StorageinHadoop2

Page 4: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ComputationinHadoop2

DistributionsofApacheHadoop

Adualapproach

AWS–infrastructureondemandfromAmazon

SimpleStorageService(S3)

ElasticMapReduce(EMR)

Gettingstarted

ClouderaQuickStartVM

AmazonEMR

CreatinganAWSaccount

Signingupforthenecessaryservices

UsingElasticMapReduce

GettingHadoopupandrunning

HowtouseEMR

AWScredentials

TheAWScommand-lineinterface

Runningtheexamples

DataprocessingwithHadoop

WhyTwitter?

Buildingourfirstdataset

Oneservice,multipleAPIs

AnatomyofaTweet

Twittercredentials

ProgrammaticaccesswithPython

Summary

2.Storage

TheinnerworkingsofHDFS

Clusterstartup

NameNodestartup

DataNodestartup

Blockreplication

Page 5: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Command-lineaccesstotheHDFSfilesystem

ExploringtheHDFSfilesystem

Protectingthefilesystemmetadata

SecondaryNameNodenottotherescue

Hadoop2NameNodeHA

KeepingtheHANameNodesinsync

Clientconfiguration

Howafailoverworks

ApacheZooKeeper–adifferenttypeoffilesystem

ImplementingadistributedlockwithsequentialZNodes

ImplementinggroupmembershipandleaderelectionusingephemeralZNodes

JavaAPI

Buildingblocks

Furtherreading

AutomaticNameNodefailover

HDFSsnapshots

Hadoopfilesystems

Hadoopinterfaces

JavaFileSystemAPI

Libhdfs

Thrift

Managingandserializingdata

TheWritableinterface

Introducingthewrapperclasses

Arraywrapperclasses

TheComparableandWritableComparableinterfaces

Storingdata

SerializationandContainers

Compression

General-purposefileformats

Column-orienteddataformats

Page 6: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

RCFile

ORC

Parquet

Avro

UsingtheJavaAPI

Summary

3.Processing–MapReduceandBeyond

MapReduce

JavaAPItoMapReduce

TheMapperclass

TheReducerclass

TheDriverclass

Combiner

Partitioning

Theoptionalpartitionfunction

Hadoop-providedmapperandreducerimplementations

Sharingreferencedata

WritingMapReduceprograms

Gettingstarted

Runningtheexamples

Localcluster

ElasticMapReduce

WordCount,theHelloWorldofMapReduce

Wordco-occurrences

Trendingtopics

TheTopNpattern

Sentimentofhashtags

Textcleanupusingchainmapper

WalkingthrougharunofaMapReducejob

Startup

Splittingtheinput

Page 7: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Taskassignment

Taskstartup

OngoingJobTrackermonitoring

Mapperinput

Mapperexecution

Mapperoutputandreducerinput

Reducerinput

Reducerexecution

Reduceroutput

Shutdown

Input/Output

InputFormatandRecordReader

Hadoop-providedInputFormat

Hadoop-providedRecordReader

OutputFormatandRecordWriter

Hadoop-providedOutputFormat

Sequencefiles

YARN

YARNarchitecture

ThecomponentsofYARN

AnatomyofaYARNapplication

LifecycleofaYARNapplication

Faulttoleranceandmonitoring

Thinkinginlayers

Executionmodels

YARNintherealworld–ComputationbeyondMapReduce

TheproblemwithMapReduce

Tez

Hive-on-tez

ApacheSpark

ApacheSamza

Page 8: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YARN-independentframeworks

YARNtodayandbeyond

Summary

4.Real-timeComputationwithSamza

StreamprocessingwithSamza

HowSamzaworks

Samzahigh-levelarchitecture

Samza’sbestfriend–ApacheKafka

YARNintegration

Anindependentmodel

HelloSamza!

Buildingatweetparsingjob

Theconfigurationfile

GettingTwitterdataintoKafka

RunningaSamzajob

SamzaandHDFS

Windowingfunctions

Multijobworkflows

Tweetsentimentanalysis

Bootstrapstreams

Statefultasks

Summary

5.IterativeComputationwithSpark

ApacheSpark

Clustercomputingwithworkingsets

ResilientDistributedDatasets(RDDs)

Actions

Deployment

SparkonYARN

SparkonEC2

GettingstartedwithSpark

Page 9: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Writingandrunningstandaloneapplications

ScalaAPI

JavaAPI

WordCountinJava

PythonAPI

TheSparkecosystem

SparkStreaming

GraphX

MLlib

SparkSQL

ProcessingdatawithApacheSpark

Buildingandrunningtheexamples

RunningtheexamplesonYARN

Findingpopulartopics

Assigningasentimenttotopics

Dataprocessingonstreams

Statemanagement

DataanalysiswithSparkSQL

SQLondatastreams

ComparingSamzaandSparkStreaming

Summary

6.DataAnalysiswithApachePig

AnoverviewofPig

Gettingstarted

RunningPig

Grunt–thePiginteractiveshell

ElasticMapReduce

FundamentalsofApachePig

ProgrammingPig

Pigdatatypes

Pigfunctions

Page 10: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Load/store

Eval

Thetuple,bag,andmapfunctions

Themath,string,anddatetimefunctions

Dynamicinvokers

Macros

Workingwithdata

Filtering

Aggregation

Foreach

Join

ExtendingPig(UDFs)

ContributedUDFs

Piggybank

ElephantBird

ApacheDataFu

AnalyzingtheTwitterstream

Prerequisites

Datasetexploration

Tweetmetadata

Datapreparation

Topnstatistics

Datetimemanipulation

Sessions

Capturinguserinteractions

Linkanalysis

Influentialusers

Summary

7.HadoopandSQL

WhySQLonHadoop

OtherSQL-on-Hadoopsolutions

Page 11: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Prerequisites

OverviewofHive

ThenatureofHivetables

Hivearchitecture

Datatypes

DDLstatements

Fileformatsandstorage

JSON

Avro

Columnarstores

Queries

StructuringHivetablesforgivenworkloads

Partitioningatable

Overwritingandupdatingdata

Bucketingandsorting

Samplingdata

Writingscripts

HiveandAmazonWebServices

HiveandS3

HiveonElasticMapReduce

ExtendingHiveQL

Programmaticinterfaces

JDBC

Thrift

Stingerinitiative

Impala

ThearchitectureofImpala

Co-existingwithHive

Adifferentphilosophy

Drill,Tajo,andbeyond

Summary

Page 12: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

8.DataLifecycleManagement

Whatdatalifecyclemanagementis

Importanceofdatalifecyclemanagement

Toolstohelp

Buildingatweetanalysiscapability

Gettingthetweetdata

IntroducingOozie

AnoteonHDFSfilepermissions

Makingdevelopmentalittleeasier

ExtractingdataandingestingintoHive

Anoteonworkflowdirectorystructure

IntroducingHCatalog

UsingHCatalog

TheOoziesharelib

HCatalogandpartitionedtables

Producingderiveddata

Performingmultipleactionsinparallel

Callingasubworkflow

Addingglobalsettings

Challengesofexternaldata

Datavalidation

Validationactions

Handlingformatchanges

HandlingschemaevolutionwithAvro

FinalthoughtsonusingAvroschemaevolution

Onlymakeadditivechanges

Manageschemaversionsexplicitly

Thinkaboutschemadistribution

Collectingadditionaldata

Schedulingworkflows

OtherOozietriggers

Page 13: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Pullingitalltogether

Othertoolstohelp

Summary

9.MakingDevelopmentEasier

Choosingaframework

Hadoopstreaming

StreamingwordcountinPython

Differencesinjobswhenusingstreaming

Findingimportantwordsintext

Calculatetermfrequency

Calculatedocumentfrequency

Puttingitalltogether–TF-IDF

KiteData

DataCore

DataHCatalog

DataHive

DataMapReduce

DataSpark

DataCrunch

ApacheCrunch

Gettingstarted

Concepts

Dataserialization

Dataprocessingpatterns

Aggregationandsorting

Joiningdata

Pipelinesimplementationandexecution

SparkPipeline

MemPipeline

Crunchexamples

Wordco-occurrence

Page 14: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TF-IDF

KiteMorphlines

Concepts

Morphlinecommands

Summary

10.RunningaHadoopCluster

I’madeveloper–Idon’tcareaboutoperations!

HadoopandDevOpspractices

ClouderaManager

Topayornottopay

ClustermanagementusingClouderaManager

ClouderaManagerandothermanagementtools

MonitoringwithClouderaManager

Findingconfigurationfiles

ClouderaManagerAPI

ClouderaManagerlock-in

Ambari–theopensourcealternative

OperationsintheHadoop2world

Sharingresources

Buildingaphysicalcluster

Physicallayout

Rackawareness

Servicelayout

Upgradingaservice

BuildingaclusteronEMR

Considerationsaboutfilesystems

GettingdataintoEMR

EC2instancesandtuning

Clustertuning

JVMconsiderations

Thesmallfilesproblem

Page 15: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Mapandreduceoptimizations

Security

EvolutionoftheHadoopsecuritymodel

Beyondbasicauthorization

ThefutureofHadoopsecurity

Consequencesofusingasecuredcluster

Monitoring

Hadoop–wherefailuresdon’tmatter

Monitoringintegration

Application-levelmetrics

Troubleshooting

Logginglevels

Accesstologfiles

ResourceManager,NodeManager,andApplicationManager

Applications

Nodes

Scheduler

MapReduce

MapReducev1

MapReducev2(YARN)

JobHistoryServer

NameNodeandDataNode

Summary

11.WheretoGoNext

Alternativedistributions

ClouderaDistributionforHadoop

HortonworksDataPlatform

MapR

Andtherest…

Choosingadistribution

Othercomputationalframeworks

Page 16: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheStorm

ApacheGiraph

ApacheHAMA

Otherinterestingprojects

HBase

Sqoop

Whir

Mahout

Hue

Otherprogrammingabstractions

Cascading

AWSresources

SimpleDBandDynamoDB

Kinesis

DataPipeline

Sourcesofinformation

Sourcecode

Mailinglistsandforums

LinkedIngroups

HUGs

Conferences

Summary

Index

Page 17: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LearningHadoop2

Page 18: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LearningHadoop2Copyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1060215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78328-551-8

www.packtpub.com

Page 19: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CreditsAuthors

GarryTurkington

GabrieleModena

Reviewers

AtdheBuja

AmitGurdasani

JakobHoman

JamesLampton

DavideSetti

ValerieParham-Thompson

CommissioningEditor

EdwardGordon

AcquisitionEditor

JoanneFitzpatrick

ContentDevelopmentEditor

VaibhavPawar

TechnicalEditors

IndrajitA.Das

MenzaMathew

CopyEditors

RoshniBanerjee

SarangChari

PranjaliChury

ProjectCoordinator

KrantiBerde

Proofreaders

SimranBhogal

MartinDiver

LawrenceA.Herman

Page 20: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PaulHindle

Indexer

HemanginiBari

Graphics

AbhinashSahu

ProductionCoordinator

NiteshThakur

CoverWork

NiteshThakur

Page 21: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AbouttheAuthorsGarryTurkingtonhasover15yearsofindustryexperience,mostofwhichhasbeenfocusedonthedesignandimplementationoflarge-scaledistributedsystems.InhiscurrentroleastheCTOatImproveDigital,heisprimarilyresponsiblefortherealizationofsystemsthatstore,process,andextractvaluefromthecompany’slargedatavolumes.BeforejoiningImproveDigital,hespenttimeatAmazon.co.uk,whereheledseveralsoftwaredevelopmentteams,buildingsystemsthatprocesstheAmazoncatalogdataforeveryitemworldwide.Priortothis,hespentadecadeinvariousgovernmentpositionsinboththeUKandtheUSA.

HehasBScandPhDdegreesinComputerSciencefromQueensUniversityBelfastinNorthernIreland,andaMaster’sdegreeinEngineeringinSystemsEngineeringfromStevensInstituteofTechnologyintheUSA.HeistheauthorofHadoopBeginnersGuide,publishedbyPacktPublishingin2013,andisacommitterontheApacheSamzaproject.

IwouldliketothankmywifeLeaandmotherSarahfortheirsupportandpatiencethroughthewritingofanotherbookandmydaughterMayaforfrequentlycheeringmeupandaskingmehardquestions.IwouldalsoliketothankGabrieleforbeingsuchanamazingco-authoronthisproject.

GabrieleModenaisadatascientistatImproveDigital.Inhiscurrentposition,heusesHadooptomanage,process,andanalyzebehavioralandmachine-generateddata.Gabrieleenjoysusingstatisticalandcomputationalmethodstolookforpatternsinlargeamountsofdata.PriortohiscurrentjobinadtechheheldanumberofpositionsinAcademiaandIndustrywherehedidresearchinmachinelearningandartificialintelligence.

HeholdsaBScdegreeinComputerSciencefromtheUniversityofTrento,ItalyandaResearchMScdegreeinArtificialIntelligence:LearningSystems,fromtheUniversityofAmsterdamintheNetherlands.

Firstandforemost,IwanttothankLauraforhersupport,constantencouragementandendlesspatienceputtingupwithfartoomany“can’tdo,I’mworkingontheHadoopbook”.SheismyrockandIdedicatethisbooktoher.

AspecialthankyougoestoAmit,Atdhe,Davide,Jakob,JamesandValerie,whoseinvaluablefeedbackandcommentarymadethisworkpossible.

Finally,I’dliketothankmyco-author,Garry,forbringingmeonboardwiththisproject;ithasbeenapleasureworkingtogether.

Page 22: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AbouttheReviewersAtdheBujaisacertifiedethicalhacker,DBA(MCITP,OCA11g),anddeveloperwithgoodmanagementskills.HeisaDBAattheAgencyforInformationSociety/MinistryofPublicAdministration,wherehealsomanagessomeprojectsofe-governanceandhasmorethan10years’experienceworkingonSQLServer.

AtdheisaregularcolumnistforUBTNews.Currently,heholdsanMScdegreeincomputerscienceandengineeringandhasabachelor’sdegreeinmanagementandinformation.Hespecializesinandiscertifiedinmanytechnologies,suchasSQLServer(allversions),Oracle11g,CEH,WindowsServer,MSProject,SCOM2012R2,BizTalk,andintegrationbusinessprocesses.

Hewasthereviewerofthebook,MicrosoftSQLServer2012withHadoop,publishedbyPacktPublishing.Hiscapabilitiesgobeyondtheaforementionedknowledge!

IthankDonikaandmyfamilyforalltheencouragementandsupport.

AmitGurdasaniisasoftwareengineeratAmazon.Hearchitectsdistributedsystemstoprocessproductcataloguedata.Priortobuildinghigh-throughputsystemsatAmazon,hewasworkingontheentiresoftwarestack,bothasasystems-leveldeveloperatEricssonandIBMaswellasanapplicationdeveloperatManhattanAssociates.Hemaintainsastronginterestinbulkdataprocessing,datastreaming,andservice-orientedsoftwarearchitectures.

JakobHomanhasbeeninvolvedwithbigdataandtheApacheHadoopecosystemformorethan5years.HeisaHadoopcommitteraswellasacommitterfortheApacheGiraph,Spark,Kafka,andTajoprojects,andisaPMCmember.HehasworkedinbringingallthesesystemstoscaleatYahoo!andLinkedIn.

JamesLamptonisaseasonedpractitionerofallthingsdata(bigorsmall)with10yearsofhands-onexperienceinbuildingandusinglarge-scaledatastorageandprocessingplatforms.Heisabelieverinholisticapproachestosolvingproblemsusingtherighttoolfortherightjob.HisfavoritetoolsincludePython,Java,Hadoop,Pig,Storm,andSQL(whichsometimesIlikeandsometimesIdon’t).HehasrecentlycompletedhisPhDfromtheUniversityofMarylandwiththereleaseofPigSqueal:amechanismforrunningPigscriptsonStorm.

Iwouldliketothankmyspouse,Andrea,andmyson,Henry,forgivingmetimetoreadwork-relatedthingsathome.IwouldalsoliketothankGarry,Gabriele,andthefolksatPacktPublishingfortheopportunitytoreviewthismanuscriptandfortheirpatienceandunderstanding,asmyfreetimewasconsumedwhenwritingmydissertation.

DavideSetti,aftergraduatinginphysicsfromtheUniversityofTrento,joinedtheSoNetresearchunitattheFondazioneBrunoKesslerinTrento,whereheappliedlarge-scaledataanalysistechniquestounderstandpeople’sbehaviorsinsocialnetworksandlargecollaborativeprojectssuchasWikipedia.

In2010,DavidemovedtoFondazione,whereheledthedevelopmentofdataanalytic

Page 23: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

toolstosupportresearchoncivicmedia,citizenjournalism,anddigitalmedia.

In2013,DavidebecametheCTOofSpazioDati,whereheleadsthedevelopmentoftoolstoperformsemanticanalysisofmassiveamountsofdatainthebusinessinformationsector.

Whennotsolvinghardproblems,Davideenjoystakingcareofhisfamilyvineyardandplayingwithhistwochildren.

Page 24: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

www.PacktPub.com

Page 25: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Page 26: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

Page 27: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Page 28: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PrefaceThisbookwilltakeyouonahands-onexplorationofthewonderfulworldthatisHadoop2anditsrapidlygrowingecosystem.Buildingonthesolidfoundationfromtheearlierversionsoftheplatform,Hadoop2allowsmultipledataprocessingframeworkstobeexecutedonasingleHadoopcluster.

Togiveanunderstandingofthissignificantevolution,wewillexplorebothhowthesenewmodelsworkandalsoshowtheirapplicationsinprocessinglargedatavolumeswithbatch,iterative,andnear-real-timealgorithms.

Page 29: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WhatthisbookcoversChapter1,Introduction,givesthebackgroundtoHadoopandtheBigDataproblemsitlookstosolve.WealsohighlighttheareasinwhichHadoop1hadroomforimprovement.

Chapter2,Storage,delvesintotheHadoopDistributedFileSystem,wheremostdataprocessedbyHadoopisstored.WeexaminetheparticularcharacteristicsofHDFS,showhowtouseit,anddiscusshowithasimprovedinHadoop2.WealsointroduceZooKeeper,anotherstoragesystemwithinHadoop,uponwhichmanyofitshigh-availabilityfeaturesrely.

Chapter3,Processing–MapReduceandBeyond,firstdiscussesthetraditionalHadoopprocessingmodelandhowitisused.WethendiscusshowHadoop2hasgeneralizedtheplatformtousemultiplecomputationalmodels,ofwhichMapReduceismerelyone.

Chapter4,Real-timeComputationwithSamza,takesadeeperlookatoneofthesealternativeprocessingmodelsenabledbyHadoop2.Inparticular,welookathowtoprocessreal-timestreamingdatawithApacheSamza.

Chapter5,IterativeComputationwithSpark,delvesintoaverydifferentalternativeprocessingmodel.Inthischapter,welookathowApacheSparkprovidesthemeanstodoiterativeprocessing.

Chapter6,DataAnalysiswithPig,demonstrateshowApachePigmakesthetraditionalcomputationalmodelofMapReduceeasiertousebyprovidingalanguagetodescribedataflows.

Chapter7,HadoopandSQL,looksathowthefamiliarSQLlanguagehasbeenimplementedatopdatastoredinHadoop.ThroughtheuseofApacheHiveanddescribingalternativessuchasClouderaImpala,weshowhowBigDataprocessingcanbemadepossibleusingexistingskillsandtools.

Chapter8,DataLifecycleManagement,takesalookatthebiggerpictureofjusthowtomanageallthatdatathatistobeprocessedinHadoop.UsingApacheOozie,weshowhowtobuildupworkflowstoingest,process,andmanagedata.

Chapter9,MakingDevelopmentEasier,focusesonaselectionoftoolsaimedathelpingadevelopergetresultsquickly.ThroughtheuseofHadoopstreaming,ApacheCrunchandKite,weshowhowtheuseoftherighttoolcanspeedupthedevelopmentlooporprovidenewAPIswithrichersemanticsandlessboilerplate.

Chapter10,RunningaHadoopCluster,takesalookattheoperationalsideofHadoop.Byfocusingontheareasofinteresttodevelopers,suchasclustermanagement,monitoring,andsecurity,thischaptershouldhelpyoutoworkbetterwithyouroperationsstaff.

Chapter11,WheretoGoNext,takesyouonawhirlwindtourthroughanumberofotherprojectsandtoolsthatwefeelareuseful,butcouldnotcoverindetailinthebookduetospaceconstraints.Wealsogivesomepointersonwheretofindadditionalsourcesofinformationandhowtoengagewiththevariousopensourcecommunities.

Page 30: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WhatyouneedforthisbookBecausemostpeopledon’thavealargenumberofsparemachinessittingaround,weusetheClouderaQuickStartvirtualmachineformostoftheexamplesinthisbook.ThisisasinglemachineimagewithallthecomponentsofafullHadoopclusterpre-installed.ItcanberunonanyhostmachinesupportingeithertheVMwareortheVirtualBoxvirtualizationtechnology.

WealsoexploreAmazonWebServicesandhowsomeoftheHadooptechnologiescanberunontheAWSElasticMapReduceservice.TheAWSservicescanbemanagedthroughawebbrowseroraLinuxcommand-lineinterface.

Page 31: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WhothisbookisforThisbookisprimarilyaimedatapplicationandsystemdevelopersinterestedinlearninghowtosolvepracticalproblemsusingtheHadoopframeworkandrelatedcomponents.Althoughweshowexamplesinafewprogramminglanguages,astrongfoundationinJavaisthemainprerequisite.

Dataengineersandarchitectsmightalsofindthematerialconcerningdatalifecycle,fileformats,andcomputationalmodelsuseful.

Page 32: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduce.jarfiletoourenvironmentbeforeaccessingindividualfields.”

Ablockofcodeissetasfollows:

topic_edges_grouped=FOREACHtopic_edges_grouped{

GENERATE

group.topic_idastopic,

group.source_idassource,

topic_edges.(destination_id,w)asedges;

}

Anycommand-lineinputoroutputiswrittenasfollows:

$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/

$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/

$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxes,appearinthetextlikethis:“Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

Page 33: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

Page 34: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

Page 35: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DownloadingtheexamplecodeThesourcecodeforthisbookcanbefoundonGitHubathttps://github.com/learninghadoop2/book-examples.Theauthorswillbeapplyinganyerratatothiscodeandkeepingituptodateasthetechnologiesevolve.Inadditionyoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 36: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

Page 37: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.

Page 38: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.

Page 39: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter1.IntroductionThisbookwillteachyouhowtobuildamazingsystemsusingthelatestreleaseofHadoop.Beforeyouchangetheworldthough,weneedtodosomegroundwork,whichiswherethischaptercomesin.

Inthisintroductorychapter,wewillcoverthefollowingtopics:

AbriefrefresheronthebackgroundtoHadoopAwalk-throughofHadoop’sevolutionThekeyelementsinHadoop2TheHadoopdistributionswe’lluseinthisbookThedatasetwe’lluseforexamples

Page 40: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AnoteonversioningInHadoop1,theversionhistorywassomewhatconvolutedwithmultipleforkedbranchesinthe0.2xrange,leadingtooddsituations,wherea1.xversioncould,insomesituations,havefewerfeaturesthana0.23release.Intheversion2codebase,thisisfortunatelymuchmorestraightforward,butit’simportanttoclarifyexactlywhichversionwewilluseinthisbook.

Hadoop2.0wasreleasedinalphaandbetaversions,andalongtheway,severalincompatiblechangeswereintroduced.Therewas,inparticular,amajorAPIstabilizationeffortbetweenthebetaandfinalreleasestages.

Hadoop2.2.0wasthefirstgeneralavailability(GA)releaseoftheHadoop2codebase,anditsinterfacesarenowdeclaredstableandforwardcompatible.Wewillthereforeusethe2.2productandinterfacesinthisbook.Thoughtheprincipleswillbeusableona2.0beta,inparticular,therewillbeAPIincompatibilitiesinthebeta.ThisisparticularlyimportantasMapReducev2wasback-portedtoHadoop1byseveraldistributionvendors,buttheseproductswerebasedonthebetaandnottheGAAPIs.Ifyouareusingsuchaproduct,thenyouwillencountertheseincompatiblechanges.ItisrecommendedthatareleasebaseduponHadoop2.2orlaterisusedforboththedevelopmentandtheproductiondeploymentsofanyHadoop2workloads.

Page 41: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ThebackgroundofHadoopWe’reassumingthatmostreaderswillhavealittlefamiliaritywithHadoop,orattheveryleast,withbigdata-processingsystems.Consequently,wewon’tgiveadetailedbackgroundastowhyHadoopissuccessfulorthetypesofproblemithelpstosolveinthisbook.However,particularlybecauseofsomeaspectsofHadoop2andtheotherproductswewilluseinlaterchapters,itisusefultogiveasketchofhowweseeHadoopfittingintothetechnologylandscapeandwhicharetheparticularproblemareaswherewebelieveitgivesthemostbenefit.

Inancienttimes,beforetheterm“bigdata”cameintothepicture(whichequatestomaybeadecadeago),therewerefewoptionstoprocessdatasetsofsizesinterabytesandbeyond.Somecommercialdatabasescould,withveryspecificandexpensivehardwaresetups,bescaledtothislevel,buttheexpertiseandcapitalexpenditurerequiredmadeitanoptionforonlythelargestorganizations.Alternatively,onecouldbuildacustomsystemaimedatthespecificproblemathand.Thissufferedfromsomeofthesameproblems(expertiseandcost)andaddedtheriskinherentinanycutting-edgesystem.Ontheotherhand,ifasystemwassuccessfullyconstructed,itwaslikelyaverygoodfittotheneed.

Fewsmall-tomid-sizecompaniesevenworriedaboutthisspace,notonlybecausethesolutionswereoutoftheirreach,buttheygenerallyalsodidn’thaveanythingclosetothedatavolumesthatrequiredsuchsolutions.Astheabilitytogenerateverylargedatasetsbecamemorecommon,sodidtheneedtoprocessthatdata.

Eventhoughlargedatabecamemoredemocratizedandwasnolongerthedomainoftheprivilegedfew,majorarchitecturalchangeswererequiredifthedata-processingsystemscouldbemadeaffordabletosmallercompanies.Thefirstbigchangewastoreducetherequiredupfrontcapitalexpenditureonthesystem;thatmeansnohigh-endhardwareorexpensivesoftwarelicenses.Previously,high-endhardwarewouldhavebeenutilizedmostcommonlyinarelativelysmallnumberofverylargeserversandstoragesystems,eachofwhichhadmultipleapproachestoavoidhardwarefailures.Thoughveryimpressive,suchsystemsarehugelyexpensive,andmovingtoalargernumberoflower-endserverswouldbethequickestwaytodramaticallyreducethehardwarecostofanewsystem.Movingmoretowardcommodityhardwareinsteadofthetraditionalenterprise-gradeequipmentwouldalsomeanareductionincapabilitiesintheareaofresilienceandfaulttolerance.Thoseresponsibilitieswouldneedtobetakenupbythesoftwarelayer.Smartersoftware,dumberhardware.

GooglestartedthechangethatwouldeventuallybeknownasHadoop,whenin2003,andin2004,theyreleasedtwoacademicpapersdescribingtheGoogleFileSystem(GFS)(http://research.google.com/archive/gfs.html)andMapReduce(http://research.google.com/archive/mapreduce.html).Thetwotogetherprovidedaplatformforverylarge-scaledataprocessinginahighlyefficientmanner.Googlehadtakenthebuild-it-yourselfapproach,butinsteadofconstructingsomethingaimedatonespecificproblemordataset,theyinsteadcreatedaplatformonwhichmultipleprocessingapplicationscouldbeimplemented.Inparticular,theyutilizedlargenumbersof

Page 42: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

commodityserversandbuiltGFSandMapReduceinawaythatassumedhardwarefailureswouldbecommonplaceandweresimplysomethingthatthesoftwareneededtodealwith.

Atthesametime,DougCuttingwasworkingontheNutchopensourcewebcrawler.HewasworkingonelementswithinthesystemthatresonatedstronglyoncetheGoogleGFSandMapReducepaperswerepublished.DougstartedworkonopensourceimplementationsoftheseGoogleideas,andHadoopwassoonborn,firstly,asasubprojectofLucene,andthenasitsowntop-levelprojectwithintheApacheSoftwareFoundation.

Yahoo!hiredDougCuttingin2006andquicklybecameoneofthemostprominentsupportersoftheHadoopproject.InadditiontooftenpublicizingsomeofthelargestHadoopdeploymentsintheworld,Yahoo!allowedDougandotherengineerstocontributetoHadoopwhileemployedbythecompany,nottomentioncontributingbacksomeofitsowninternallydevelopedHadoopimprovementsandextensions.

Page 43: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ComponentsofHadoopThebroadHadoopumbrellaprojecthasmanycomponentsubprojects,andwe’lldiscussseveraloftheminthisbook.Atitscore,Hadoopprovidestwoservices:storageandcomputation.AtypicalHadoopworkflowconsistsofloadingdataintotheHadoopDistributedFileSystem(HDFS)andprocessingusingtheMapReduceAPIorseveraltoolsthatrelyonMapReduceasanexecutionframework.

Hadoop1:HDFSandMapReduce

BothlayersaredirectimplementationsofGoogle’sownGFSandMapReducetechnologies.

Page 44: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CommonbuildingblocksBothHDFSandMapReduceexhibitseveralofthearchitecturalprinciplesdescribedintheprevioussection.Inparticular,thecommonprinciplesareasfollows:

Botharedesignedtorunonclustersofcommodity(thatis,lowtomediumspecification)serversBothscaletheircapacitybyaddingmoreservers(scale-out)asopposedtothepreviousmodelsofusinglargerhardware(scale-up)BothhavemechanismstoidentifyandworkaroundfailuresBothprovidemostoftheirservicestransparently,allowingtheusertoconcentrateontheproblemathandBothhaveanarchitecturewhereasoftwareclustersitsonthephysicalserversandmanagesaspectssuchasapplicationloadbalancingandfaulttolerance,withoutrelyingonhigh-endhardwaretodeliverthesecapabilities

Page 45: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StorageHDFSisafilesystem,thoughnotaPOSIX-compliantone.Thisbasicallymeansthatitdoesnotdisplaythesamecharacteristicsasthatofaregularfilesystem.Inparticular,thecharacteristicsareasfollows:

HDFSstoresfilesinblocksthataretypicallyatleast64MBor(morecommonlynow)128MBinsize,muchlargerthanthe4-32KBseeninmostfilesystemsHDFSisoptimizedforthroughputoverlatency;itisveryefficientatstreamingreadsoflargefilesbutpoorwhenseekingformanysmallonesHDFSisoptimizedforworkloadsthataregenerallywrite-onceandread-manyInsteadofhandlingdiskfailuresbyhavingphysicalredundanciesindiskarraysorsimilarstrategies,HDFSusesreplication.Eachoftheblockscomprisingafileisstoredonmultiplenodeswithinthecluster,andaservicecalledtheNameNodeconstantlymonitorstoensurethatfailureshavenotdroppedanyblockbelowthedesiredreplicationfactor.Ifthisdoeshappen,thenitschedulesthemakingofanothercopywithinthecluster.

Page 46: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ComputationMapReduceisanAPI,anexecutionengine,andaprocessingparadigm;itprovidesaseriesoftransformationsfromasourceintoaresultdataset.Inthesimplestcase,theinputdataisfedthroughamapfunctionandtheresultanttemporarydataisthenfedthroughareducefunction.

MapReduceworksbestonsemistructuredorunstructureddata.Insteadofdataconformingtorigidschemas,therequirementisinsteadthatthedatacanbeprovidedtothemapfunctionasaseriesofkey-valuepairs.Theoutputofthemapfunctionisasetofotherkey-valuepairs,andthereducefunctionperformsaggregationtocollectthefinalsetofresults.

Hadoopprovidesastandardspecification(thatis,interface)forthemapandreducephases,andtheimplementationoftheseareoftenreferredtoasmappersandreducers.AtypicalMapReduceapplicationwillcompriseanumberofmappersandreducers,andit’snotunusualforseveralofthesetobeextremelysimple.Thedeveloperfocusesonexpressingthetransformationbetweenthesourceandtheresultantdata,andtheHadoopframeworkmanagesallaspectsofjobexecutionandcoordination.

Page 47: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BettertogetherItispossibletoappreciatetheindividualmeritsofHDFSandMapReduce,buttheyareevenmorepowerfulwhencombined.Theycanbeusedindividually,butwhentheyaretogether,theybringoutthebestineachother,andthiscloseinterworkingwasamajorfactorinthesuccessandacceptanceofHadoop1.

WhenaMapReducejobisbeingplanned,Hadoopneedstodecideonwhichhosttoexecutethecodeinordertoprocessthedatasetmostefficiently.IftheMapReduceclusterhostsareallpullingtheirdatafromasinglestoragehostorarray,thenthislargelydoesn’tmatterasthestoragesystemisasharedresourcethatwillcausecontention.IfthestoragesystemwasmoretransparentandallowedMapReducetomanipulateitsdatamoredirectly,thentherewouldbeanopportunitytoperformtheprocessingclosertothedata,buildingontheprincipleofitbeinglessexpensivetomoveprocessingthandata.

ThemostcommondeploymentmodelforHadoopseestheHDFSandMapReduceclustersdeployedonthesamesetofservers.EachhostthatcontainsdataandtheHDFScomponenttomanagethedataalsohostsaMapReducecomponentthatcanscheduleandexecutedataprocessing.WhenajobissubmittedtoHadoop,itcanusethelocalityoptimizationtoscheduledataonthehostswheredataresidesasmuchaspossible,thusminimizingnetworktrafficandmaximizingperformance.

Page 48: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Hadoop2–what’sthebigdeal?IfwelookatthetwomaincomponentsofthecoreHadoopdistribution,storageandcomputation,weseethatHadoop2hasaverydifferentimpactoneachofthem.WhereastheHDFSfoundinHadoop2ismostlyamuchmorefeature-richandresilientproductthantheHDFSinHadoop1,forMapReduce,thechangesaremuchmoreprofoundandhave,infact,alteredhowHadoopisperceivedasaprocessingplatformingeneral.Let’slookatHDFSinHadoop2first.

Page 49: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StorageinHadoop2We’lldiscusstheHDFSarchitectureinmoredetailinChapter2,Storage,butfornow,it’ssufficienttothinkofamaster-slavemodel.Theslavenodes(calledDataNodes)holdtheactualfilesystemdata.Inparticular,eachhostrunningaDataNodewilltypicallyhaveoneormoredisksontowhichfilescontainingthedataforeachHDFSblockarewritten.TheDataNodeitselfhasnounderstandingoftheoverallfilesystem;itsroleistostore,serve,andensuretheintegrityofthedataforwhichitisresponsible.

Themasternode(calledtheNameNode)isresponsibleforknowingwhichoftheDataNodesholdswhichblockandhowtheseblocksarestructuredtoformthefilesystem.Whenaclientlooksatthefilesystemandwishestoretrieveafile,it’sviaarequesttotheNameNodethatthelistofrequiredblocksisretrieved.

ThismodelworkswellandhasbeenscaledtoclusterswithtensofthousandsofnodesatcompaniessuchasYahoo!So,thoughitisscalable,thereisaresiliencyrisk;iftheNameNodebecomesunavailable,thentheentireclusterisrenderedeffectivelyuseless.NoHDFSoperationscanbeperformed,andsincethevastmajorityofinstallationsuseHDFSasthestoragelayerforservices,suchasMapReduce,theyalsobecomeunavailableeveniftheyarestillrunningwithoutproblems.

Morecatastrophically,theNameNodestoresthefilesystemmetadatatoapersistentfileonitslocalfilesystem.IftheNameNodehostcrashesinawaythatthisdataisnotrecoverable,thenalldataontheclusteriseffectivelylostforever.ThedatawillstillexistonthevariousDataNodes,butthemappingofwhichblockscomprisewhichfilesislost.Thisiswhy,inHadoop1,thebestpracticewastohavetheNameNodesynchronouslywriteitsfilesystemmetadatatobothlocaldisksandatleastoneremotenetworkvolume(typicallyviaNFS).

SeveralNameNodehigh-availability(HA)solutionshavebeenmadeavailablebythird-partysuppliers,butthecoreHadoopproductdidnotoffersuchresilienceinVersion1.Giventhisarchitecturalsinglepointoffailureandtheriskofdataloss,itwon’tbeasurprisetohearthatNameNodeHAisoneofthemajorfeaturesofHDFSinHadoop2andissomethingwe’lldiscussindetailinlaterchapters.ThefeatureprovidesbothastandbyNameNodethatcanbeautomaticallypromotedtoserviceallrequestsshouldtheactiveNameNodefail,butalsobuildsadditionalresilienceforthecriticalfilesystemmetadataatopthismechanism.

HDFSinHadoop2isstillanon-POSIXfilesystem;itstillhasaverylargeblocksizeanditstilltradeslatencyforthroughput.However,itdoesnowhaveafewcapabilitiesthatcanmakeitlookalittlemorelikeatraditionalfilesystem.Inparticular,thecoreHDFSinHadoop2nowcanberemotelymountedasanNFSvolume.Thisisanotherfeaturethatwaspreviouslyofferedasaproprietarycapabilitybythird-partysuppliersbutisnowinthemainApachecodebase.

Overall,theHDFSinHadoop2ismoreresilientandcanbemoreeasilyintegratedintoexistingworkflowsandprocesses.It’sastrongevolutionoftheproductfoundinHadoop

Page 50: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

1.

Page 51: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ComputationinHadoop2TheworkonHDFS2wasstartedbeforeadirectionforMapReducecrystallized.ThiswaslikelyduetothefactthatfeaturessuchasNameNodeHAweresuchanobviouspaththatthecommunityknewthemostcriticalareastoaddress.However,MapReducedidn’treallyhaveasimilarlistofareasofimprovement,andthat’swhy,whentheMRv2initiativestarted,itwasn’tcompletelyclearwhereitwouldlead.

PerhapsthemostfrequentcriticismofMapReduceinHadoop1washowitsbatchprocessingmodelwasill-suitedtoproblemdomainswherefasterresponsetimeswererequired.Hive,forexample,whichwe’lldiscussinChapter7,HadoopandSQL,providesaSQL-likeinterfaceontoHDFSdata,but,behindthescenes,thestatementsareconvertedintoMapReducejobsthatarethenexecutedlikeanyother.Anumberofotherproductsandtoolstookasimilarapproach,providingaspecificuser-facinginterfacethathidaMapReducetranslationlayer.

Thoughthisapproachhasbeenverysuccessful,andsomeamazingproductshavebeenbuilt,thefactremainsthatinmanycases,thereisamismatchasalloftheseinterfaces,someofwhichexpectacertaintypeofresponsiveness,arebehindthescenes,beingexecutedonabatch-processingplatform.WhenlookingtoenhanceMapReduce,improvementscouldbemadetomakeitabetterfittotheseusecases,butthefundamentalmismatchwouldremain.ThissituationledtoasignificantchangeoffocusoftheMRv2initiative;perhapsMapReduceitselfdidn’tneedchange,buttherealneedwastoenabledifferentprocessingmodelsontheHadoopplatform.ThuswasbornYetAnotherResourceNegotiator(YARN).

LookingatMapReduceinHadoop1,theproductactuallydidtwoquitedifferentthings;itprovidedtheprocessingframeworktoexecuteMapReducecomputations,butitalsomanagedtheallocationofthiscomputationacrossthecluster.Notonlydiditdirectdatatoandbetweenthespecificmapandreducetasks,butitalsodeterminedwhereeachtaskwouldrun,andmanagedthefulljoblifecycle,monitoringthehealthofeachtaskandnode,reschedulingifanyfailed,andsoon.

Thisisnotatrivialtask,andtheautomatedparallelizationofworkloadshasalwaysbeenoneofthemainbenefitsofHadoop.IfwelookatMapReduceinHadoop1,weseethataftertheuserdefinesthekeycriteriaforthejob,everythingelseistheresponsibilityofthesystem.Critically,fromascaleperspective,thesameMapReducejobcanbeappliedtodatasetsofanyvolumehostedonclustersofanysize.Ifthedatais1GBinsizeandonasinglehost,thenHadoopwillscheduletheprocessingaccordingly.Ifthedataisinstead1PBinsizeandhostedacross1,000machines,thenitdoeslikewise.Fromtheuser’sperspective,theactualscaleofthedataandclusteristransparent,andasidefromaffectingthetimetakentoprocessthejob,itdoesnotchangetheinterfacewithwhichtointeractwiththesystem.

InHadoop2,thisroleofjobschedulingandresourcemanagementisseparatedfromthatofexecutingtheactualapplication,andisimplementedbyYARN.

Page 52: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YARNisresponsibleformanagingtheclusterresources,andsoMapReduceexistsasanapplicationthatrunsatoptheYARNframework.TheMapReduceinterfaceinHadoop2iscompletelycompatiblewiththatinHadoop1,bothsemanticallyandpractically.However,underthecovers,MapReducehasbecomeahostedapplicationontheYARNframework.

ThesignificanceofthissplitisthatotherapplicationscanbewrittenthatprovideprocessingmodelsmorefocusedontheactualproblemdomainandcanoffloadalltheresourcemanagementandschedulingresponsibilitiestoYARN.ThelatestversionsofmanydifferentexecutionengineshavebeenportedontoYARN,eitherinaproduction-readyorexperimentalstate,andithasshownthattheapproachcanallowasingleHadoopclustertoruneverythingfrombatch-orientedMapReducejobsthroughfast-responseSQLqueriestocontinuousdatastreamingandeventoimplementmodelssuchasgraphprocessingandtheMessagePassingInterface(MPI)fromtheHighPerformanceComputing(HPC)world.ThefollowingdiagramshowsthearchitectureofHadoop2:

Hadoop2

ThisiswhymuchoftheattentionandexcitementaroundHadoop2hasbeenfocusedonYARNandframeworksthatsitontopofit,suchasApacheTezandApacheSpark.WithYARN,theHadoopclusterisnolongerjustabatch-processingengine;itisthesingleplatformonwhichavastarrayofprocessingtechniquescanbeappliedtotheenormousdatavolumesstoredinHDFS.Moreover,applicationscanbuildonthesecomputationparadigmsandexecutionmodels.

TheanalogythatisachievingsometractionistothinkofYARNastheprocessingkerneluponwhichotherdomain-specificapplicationscanbebuilt.We’lldiscussYARNinmoredetailinthisbook,particularlyinChapter3,Processing–MapReduceandBeyond,Chapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark.

Page 53: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,
Page 54: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DistributionsofApacheHadoopIntheveryearlydaysofHadoop,theburdenofinstalling(oftenbuildingfromsource)andmanagingeachcomponentanditsdependenciesfellontheuser.Asthesystembecamemorepopularandtheecosystemofthird-partytoolsandlibrariesstartedtogrow,thecomplexityofinstallingandmanagingaHadoopdeploymentincreaseddramaticallytothepointwhereprovidingacoherentofferofsoftwarepackages,documentation,andtrainingbuiltaroundthecoreApacheHadoophasbecomeabusinessmodel.EntertheworldofdistributionsforApacheHadoop.

HadoopdistributionsareconceptuallysimilartohowLinuxdistributionsprovideasetofintegratedsoftwarearoundacommoncore.Theytaketheburdenofbundlingandpackagingsoftwarethemselvesandprovidetheuserwithaneasywaytoinstall,manage,anddeployApacheHadoopandaselectednumberofthird-partylibraries.Inparticular,thedistributionreleasesdeliveraseriesofproductversionsthatarecertifiedtobemutuallycompatible.Historically,puttingtogetheraHadoop-basedplatformwasoftengreatlycomplicatedbythevariousversioninterdependencies.

Cloudera(http://www.cloudera.com),Hortonworks(http://www.hortonworks.com),andMapR(http://www.mapr.com)areamongstthefirsttohavereachedthemarket,eachcharacterizedbydifferentapproachesandsellingpoints.Hortonworkspositionsitselfastheopensourceplayer;ClouderaisalsocommittedtoopensourcebutaddsproprietarybitsforconfiguringandmanagingHadoop;MapRprovidesahybridopensource/proprietaryHadoopdistributioncharacterizedbyaproprietaryNFSlayerinsteadofHDFSandafocusonprovidingservices.

AnotherstrongplayerinthedistributionsecosystemisAmazon,whichoffersaversionofHadoopcalledElasticMapReduce(EMR)ontopoftheAmazonWebServices(AWS)infrastructure.

WiththeadventofHadoop2,thenumberofavailabledistributionsforHadoophasincreaseddramatically,farinexcessofthefourwementioned.ApossiblyincompletelistofsoftwareofferingsthatincludesApacheHadoopcanbefoundathttp://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support.

Page 55: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AdualapproachInthisbook,wewilldiscussboththebuildingandthemanagementoflocalHadoopclustersinadditiontoshowinghowtopushtheprocessingintothecloudviaEMR.

Thereasonforthisistwofold:firstly,thoughEMRmakesHadoopmuchmoreaccessible,thereareaspectsofthetechnologythatonlybecomeapparentwhenmanuallyadministeringthecluster.AlthoughitisalsopossibletouseEMRinamoremanualmode,we’llgenerallyusealocalclusterforsuchexplorations.Secondly,thoughitisn’tnecessarilyaneither/ordecision,manyorganizationsuseamixtureofin-houseandcloud-hostedcapacities,sometimesduetoaconcernofoverrelianceonasingleexternalprovider,butpracticallyspeaking,it’softenconvenienttododevelopmentandsmall-scaletestsonlocalcapacityandthendeployatproductionscaleintothecloud.

Inafewofthelaterchapters,wherewediscussadditionalproductsthatintegratewithHadoop,we’llmostlygiveexamplesoflocalclusters,asthereisnodifferencebetweenhowtheproductsworkregardlessofwheretheyaredeployed.

Page 56: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AWS–infrastructureondemandfromAmazonAWSisasetofcloud-computingservicesofferedbyAmazon.Wewilluseseveraloftheseservicesinthisbook.

Page 57: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SimpleStorageService(S3)Amazon’sSimpleStorageService(S3),foundathttp://aws.amazon.com/s3/,isastorageservicethatprovidesasimplekey-valuestoragemodel.Usingweb,command-line,orprogrammaticinterfacestocreateobjects,whichcanbeanythingfromtextfilestoimagestoMP3s,youcanstoreandretrieveyourdatabasedonahierarchicalmodel.Inthismodel,youcreatebucketsthatcontainobjects.Eachbuckethasauniqueidentifier,andwithineachbucket,everyobjectisuniquelynamed.ThissimplestrategyenablesanextremelypowerfulserviceforwhichAmazontakescompleteresponsibility(forservicescaling,inadditiontoreliabilityandavailabilityofdata).

Page 58: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ElasticMapReduce(EMR)Amazon’sElasticMapReduce,foundathttp://aws.amazon.com/elasticmapreduce/,isbasicallyHadoopinthecloud.Usinganyofthemultipleinterfaces(webconsole,CLI,orAPI),aHadoopworkflowisdefinedwithattributessuchasthenumberofHadoophostsrequiredandthelocationofthesourcedata.TheHadoopcodeimplementingtheMapReducejobsisprovided,andthevirtualGobuttonispressed.

Initsmostimpressivemode,EMRcanpullsourcedatafromS3,processitonaHadoopclusteritcreatesonAmazon’svirtualhoston-demandserviceEC2,pushtheresultsbackintoS3,andterminatetheHadoopclusterandtheEC2virtualmachineshostingit.Naturally,eachoftheseserviceshasacost(usuallyonperGBstoredandserver-timeusagebasis),buttheabilitytoaccesssuchpowerfuldata-processingcapabilitieswithnoneedfordedicatedhardwareisapowerfulone.

Page 59: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingstartedWewillnowdescribethetwoenvironmentswewillusethroughoutthebook:Cloudera’sQuickStartvirtualmachinewillbeourreferencesystemonwhichwewillshowallexamples,butwewilladditionallydemonstratesomeexamplesonAmazon’sEMRwhenthereissomeparticularlyvaluableaspecttorunningtheexampleintheon-demandservice.

Althoughtheexamplesandcodeprovidedareaimedatbeingasgeneral-purposeandportableaspossible,ourreferencesetup,whentalkingaboutalocalcluster,willbeClouderarunningatopCentOSLinux.

Forthemostpart,wewillshowexamplesthatmakeuseof,orareexecutedfrom,aterminalprompt.AlthoughHadoop’sgraphicalinterfaceshaveimprovedsignificantlyovertheyears(forexample,theexcellentHUEandClouderaManager),whenitcomestodevelopment,automation,andprogrammaticaccesstothesystem,thecommandlineisstillthemostpowerfultoolforthejob.

Allexamplesandsourcecodepresentedinthisbookcanbedownloadedfromhttps://github.com/learninghadoop2/book-examples.Inaddition,wehaveahomepageforthebookwherewewillpublishupdatesandrelatedmaterialathttp://learninghadoop2.com.

Page 60: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClouderaQuickStartVMOneoftheadvantagesofHadoopdistributionsisthattheygiveaccesstoeasy-to-install,packagedsoftware.ClouderatakesthisonestepfurtherandprovidesafreelydownloadableVirtualMachineinstanceofitslatestdistribution,knownastheCDHQuickStartVM,deployedontopofCentOSLinux.

Intheremainingpartsofthisbook,wewillusetheCDH5.0.0VMasthereferenceandbaselinesystemtorunexamplesandsourcecode.ImagesoftheVMareavailableforVMware(http://www.vmware.com/nl/products/player/),KVM(http://www.linux-kvm.org/page/Main_Page),andVirtualBox(https://www.virtualbox.org/)virtualizationsystems.

Page 61: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AmazonEMRBeforeusingElasticMapReduce,weneedtosetupanAWSaccountandregisteritwiththenecessaryservices.

CreatinganAWSaccountAmazonhasintegrateditsgeneralaccountswithAWS,whichmeansthat,ifyoualreadyhaveanaccountforanyoftheAmazonretailwebsites,thisistheonlyaccountyouwillneedtouseAWSservices.

NoteNotethatAWSserviceshaveacost;youwillneedanactivecreditcardassociatedwiththeaccounttowhichchargescanbemade.

IfyourequireanewAmazonaccount,gotohttp://aws.amazon.com,selectCreateanewAWSaccount,andfollowtheprompts.Amazonhasaddedafreetierforsomeservices,soyoumightfindthatintheearlydaysoftestingandexploration,youarekeepingmanyofyouractivitieswithinthenonchargedtier.Thescopeofthefreetierhasbeenexpanding,somakesureyouknowwhatyouwillandwon’tbechargedfor.

SigningupforthenecessaryservicesOnceyouhaveanAmazonaccount,youwillneedtoregisteritforusewiththerequiredAWSservices,thatis,SimpleStorageService(S3),ElasticComputeCloud(EC2),andElasticMapReduce.ThereisnocosttosimplysignuptoanyAWSservice;theprocessjustmakestheserviceavailabletoyouraccount.

GototheS3,EC2,andEMRpageslinkedfromhttp://aws.amazon.com,clickontheSignupbuttononeachpage,andthenfollowtheprompts.

Page 62: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

UsingElasticMapReduceHavingcreatedanaccountwithAWSandregisteredalltherequiredservices,wecanproceedtoconfigureprogrammaticaccesstoEMR.

Page 63: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingHadoopupandrunningNoteCaution!Thiscostsrealmoney!

Beforegoinganyfurther,itiscriticaltounderstandthatuseofAWSserviceswillincurchargesthatwillappearonthecreditcardassociatedwithyourAmazonaccount.Mostofthechargesarequitesmallandincreasewiththeamountofinfrastructureconsumed;storing10GBofdatainS3costs10timesmorethan1GB,andrunning20EC2instancescosts20timesasmuchasasingleone.Therearetieredcostmodels,sotheactualcoststendtohavesmallermarginalincreasesathigherlevels.Butyoushouldreadcarefullythroughthepricingsectionsforeachservicebeforeusinganyofthem.NotealsothatcurrentlydatatransferoutofAWSservices,suchasEC2andS3,ischargeable,butdatatransferbetweenservicesisnot.Thismeansitisoftenmostcost-effectivetocarefullydesignyouruseofAWStokeepdatawithinAWSthroughasmuchofthedataprocessingaspossible.ForinformationregardingAWSandEMR,consulthttp://aws.amazon.com/elasticmapreduce/#pricing.

HowtouseEMRAmazonprovidesbothwebandcommand-lineinterfacestoEMR.Bothinterfacesarejustafrontendtotheverysamesystem;aclustercreatedwiththecommand-lineinterfacecanbeinspectedandmanagedwiththewebtoolsandvice-versa.

Forthemostpart,wewillbeusingthecommand-linetoolstocreateandmanageclustersprogrammaticallyandwillfallbackonthewebinterfacecaseswhereitmakessensetodoso.

AWScredentialsBeforeusingeitherprogrammaticorcommand-linetools,weneedtolookathowanaccountholderauthenticatestoAWStomakesuchrequests.

EachAWSaccounthasseveralidentifiers,suchasthefollowing,thatareusedwhenaccessingthevariousservices:

AccountID:eachAWSaccounthasanumericID.Accesskey:theassociatedaccesskeyisusedtoidentifytheaccountmakingtherequest.Secretaccesskey:thepartnertotheaccesskeyisthesecretaccesskey.Theaccesskeyisnotasecretandcouldbeexposedinservicerequests,butthesecretaccesskeyiswhatyouusetovalidateyourselfastheaccountowner.Treatitlikeyourcreditcard.Keypairs:thesearethekeypairsusedtologintoEC2hosts.Itispossibletoeithergeneratepublic/privatekeypairswithinEC2ortoimportexternallygeneratedkeysintothesystem.

Page 64: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

UsercredentialsandpermissionsaremanagedviaawebservicecalledIdentityandAccessManagement(IAM),whichyouneedtosignuptoinordertoobtainaccessandsecretkeys.

Ifthissoundsconfusing,it’sbecauseitis,atleastatfirst.WhenusingatooltoaccessanAWSservice,there’susuallythesingle,upfrontstepofaddingtherightcredentialstoaconfiguredfile,andtheneverythingjustworks.However,ifyoudodecidetoexploreprogrammaticorcommand-linetools,itwillbeworthinvestingalittletimetoreadthedocumentationforeachservicetounderstandhowitssecurityworks.MoreinformationoncreatinganAWSaccountandobtainingaccesscredentialscanbefoundathttp://docs.aws.amazon.com/iam.

Page 65: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheAWScommand-lineinterfaceEachAWSservicehistoricallyhaditsownsetofcommand-linetools.Recentlythough,Amazonhascreatedasingle,unifiedcommand-linetoolthatallowsaccesstomostservices.TheAmazonCLIcanbefoundathttp://aws.amazon.com/cli.

Itcanbeinstalledfromatarballorviathepiporeasy_installpackagemanagers.

OntheCDHQuickStartVM,wecaninstallawscliusingthefollowingcommand:

$pipinstallawscli

InordertoaccesstheAPI,weneedtoconfigurethesoftwaretoauthenticatetoAWSusingouraccessandsecretkeys.

ThisisalsoagoodmomenttosetupanEC2keypairbyfollowingtheinstructionsprovidedathttps://console.aws.amazon.com/ec2/home?region=us-east-1#c=EC2&s=KeyPairs.

AlthoughakeypairisnotstrictlynecessarytorunanEMRcluster,itwillgiveusthecapabilitytoremotelylogintothemasternodeandgainlow-levelaccesstothecluster.

Thefollowingcommandwillguideyouthroughaseriesofconfigurationstepsandstoretheresultingconfigurationinthe.aws/credentialfile:

$awsconfigure

OncetheCLIisconfigured,wecanqueryAWSwithaws<service><arguments>.TocreateandqueryanS3bucketusesomethinglikethefollowingcommand.NotethatS3bucketsneedtobegloballyuniqueacrossallAWSaccounts,somostcommonnames,suchass3://mybucket,willnotbeavailable:

$awss3mbs3://learninghadoop2

$awss3ls

WecanprovisionanEMRclusterwithfivem1.xlargenodesusingthefollowingcommands:

$awsemrcreate-cluster--name"EMRcluster"\

--ami-version3.2.0\

--instance-typem1.xlarge\

--instance-count5\

--log-uris3://learninghadoop2/emr-logs

Where--ami-versionistheIDofanAmazonMachineImagetemplate(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html),and--log-uriinstructsEMRtocollectlogsandstoretheminthelearninghadoop2S3bucket.

NoteIfyoudidnotspecifyadefaultregionwhensettinguptheAWSCLI,thenyouwillalsohavetoaddonetomostEMRcommandsintheAWSCLIusingthe—regionargument;forexample,--regioneu-west-1isruntousetheEUIrelandregion.Youcanfind

Page 66: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

detailsofallavailableAWSregionsathttp://docs.aws.amazon.com/general/latest/gr/rande.html.

Wecansubmitworkflowsbyaddingstepstoarunningclusterusingthefollowingcommand:

$awsemradd-steps--cluster-id<cluster>--steps<steps>

Toterminatethecluster,usethefollowingcommandline:

$awsemrterminate-clusters--cluster-id<cluster>

Inlaterchapters,wewillshowyouhowtoaddstepstoexecuteMapReducejobsandPigscripts.

MoreinformationonusingtheAWSCLIcanbefoundathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage.html.

Page 67: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

RunningtheexamplesThesourcecodeofallexamplesisavailableathttps://github.com/learninghadoop2/book-examples.

Gradle(http://www.gradle.org/)scriptsandconfigurationsareprovidedtocompilemostoftheJavacode.ThegradlewscriptincludedwiththeexamplewillbootstrapGradleanduseittofetchdependenciesandcompilecode.

JARfilescanbecreatedbyinvokingthejartaskviaagradlewscript,asfollows:

./gradlewjar

JobsareusuallyexecutedbysubmittingaJARfileusingthehadoopjarcommand,asfollows:

$hadoopjarexample.jar<MainClass>[-libjars$LIBJARS]arg1arg2…argN

Theoptional-libjarsparameterspecifiesruntimethird-partydependenciestoshiptoremotenodes.

NoteSomeoftheframeworkswewillworkwith,suchasApacheSpark,comewiththeirownbuildandpackagemanagementtools.Additionalinformationandresourceswillbeprovidedfortheseparticularcases.

ThecopyJarGradletaskcanbeusedtodownloadthird-partydependenciesintobuild/libjars/<example>/lib,asfollows:

./gradlewcopyJar

Forconvenience,weprovideafatJarGradletaskthatbundlestheexampleclassesandtheirdependenciesintoasingleJARfile.Althoughthisapproachisdiscouragedinfavorofusing–libjar,itmightcomeinhandywhendealingwithdependencyissues.

Thefollowingcommandwillgeneratebuild/libs/<example>-all.jar:

$./gradlewfatJar

Page 68: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataprocessingwithHadoopIntheremainingchaptersofthisbook,wewillintroducethecorecomponentsoftheHadoopecosystemaswellasanumberofthird-partytoolsandlibrariesthatwillmakewritingrobust,distributedcodeanaccessibleandhopefullyenjoyabletask.Whilereadingthisbook,youwilllearnhowtocollect,process,store,andextractinformationfromlargeamountsofstructuredandunstructureddata.

WewilluseadatasetgeneratedfromTwitter’s(http://www.twitter.com)real-timefirehose.Thisapproachwillallowustoexperimentwithrelativelysmalldatasetslocallyand,onceready,scaletheexamplesuptoproduction-leveldatasizes.

Page 69: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WhyTwitter?ThankstoitsprogrammaticAPIs,Twitterprovidesaneasywaytogeneratedatasetsofarbitrarysizeandinjectthemintoourlocal-orcloud-basedHadoopclusters.Otherthanthesheersize,thedatasetthatwewillusehasanumberofpropertiesthatfitseveralinterestingdatamodelingandprocessingusecases.

Twitterdatapossessesthefollowingproperties:

Unstructured:eachstatusupdateisatextmessagethatcancontainreferencestomediacontentsuchasURLsandimagesStructured:tweetsaretimestamped,sequentialrecordsGraph:relationshipssuchasrepliesandmentionscanbemodeledasanetworkofinteractionsGeolocated:thelocationwhereatweetwaspostedorwhereauserresidesRealtime:alldatageneratedonTwitterisavailableviaareal-timefirehose

ThesepropertieswillbereflectedinthetypeofapplicationthatwecanbuildwithHadoop.Theseincludeexamplesofsentimentanalysis,socialnetwork,andtrendanalysis.

Page 70: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BuildingourfirstdatasetTwitter’stermsofserviceprohibitredistributionofuser-generateddatainanyform;forthisreason,wecannotmakeavailableacommondataset.Instead,wewilluseaPythonscripttoprogrammaticallyaccesstheplatformandcreateadumpofusertweetscollectedfromalivestream.

Oneservice,multipleAPIsTwitteruserssharemorethan200milliontweets,alsoknownasstatusupdates,aday.TheplatformoffersaccesstothiscorpusofdataviafourtypesofAPIs,eachofwhichrepresentsafacetofTwitterandaimsatsatisfyingspecificusecases,suchaslinkingandinteractingwithtwittercontentfromthird-partysources(TwitterforProducts),programmaticaccesstospecificusers’orsites’content(REST),searchcapabilitiesacrossusers’orsites’timelines(Search),andaccesstoallcontentcreatedontheTwitternetworkinrealtime(Streaming).

TheStreamingAPIallowsdirectaccesstotheTwitterstream,trackingkeywords,retrievinggeotaggedtweetsfromacertainregion,andmuchmore.Inthisbook,wewillmakeuseofthisAPIasadatasourcetoillustrateboththebatchandreal-timecapabilitiesofHadoop.Wewillnot,however,interactwiththeAPIitself;rather,wewillmakeuseofthird-partylibrariestooffloadchoressuchasauthenticationandconnectionmanagement.

AnatomyofaTweetEachtweetobjectreturnedbyacalltothereal-timeAPIsisrepresentedasaserializedJSONstringthatcontainsasetofattributesandmetadatainadditiontoatextualmessage.ThisadditionalcontentincludesanumericalIDthatuniquelyidentifiesthetweet,thelocationwherethetweetwasshared,theuserwhosharedit(userobject),whetheritwasrepublishedbyotherusers(retweeted)andhowmanytimes(retweetcount),themachine-detectedlanguageofitstext,whetherthetweetwaspostedinreplytosomeoneand,ifso,theuserandtweetIDsitrepliedto,andsoon.

ThestructureofaTweet,andanyotherobjectexposedbytheAPI,isconstantlyevolving.Anup-to-datereferencecanbefoundathttps://dev.twitter.com/docs/platform-objects/tweets.

TwittercredentialsTwittermakesuseoftheOAuthprotocoltoauthenticateandauthorizeaccessfromthird-partysoftwaretoitsplatform.

Theapplicationobtainsthroughanexternalchannel,forinstanceawebform,thefollowingpairofcredentials:

ConsumerkeyConsumersecret

Theconsumersecretisneverdirectlytransmittedtothethirdpartyasitisusedtosign

Page 71: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

eachrequest.

Theuserauthorizestheapplicationtoaccesstheserviceviaathree-wayprocessthat,oncecompleted,grantstheapplicationatokenconsistingofthefollowing:

AccesstokenAccesssecret

Similarly,totheconsumer,theaccesssecretisneverdirectlytransmittedtothethirdparty,anditisusedtosigneachrequest.

InordertousetheStreamingAPI,wewillfirstneedtoregisteranapplicationandgrantitprogrammaticaccesstothesystem.IfyourequireanewTwitteraccount,proceedtothesignuppageathttps://twitter.com/signup,andfillintherequiredinformation.Oncethisstepiscompleted,weneedtocreateasampleapplicationthatwillaccesstheAPIonourbehalfandgrantittheproperauthorizationrights.Wewilldosousingthewebformfoundathttps://dev.twitter.com/apps.

Whencreatinganewapp,weareaskedtogiveitaname,adescription,andaURL.ThefollowingscreenshotshowsthesettingsofasampleapplicationnamedLearningHadoop2BookDataset.Forthepurposeofthisbook,wedonotneedtospecifyavalidURL,soweusedaplaceholderinstead.

Oncetheformisfilledin,weneedtoreviewandacceptthetermsofserviceandclickontheCreateApplicationbuttoninthebottom-leftcornerofthepage.

Wearenowpresentedwithapagethatsummarizesourapplicationdetailsasseeninthefollowingscreenshot;theauthenticationandauthorizationcredentialscanbefoundundertheOAuthTooltab.

WearefinallyreadytogenerateourveryfirstTwitterdataset.

Page 72: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,
Page 73: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ProgrammaticaccesswithPythonInthissection,wewillusePythonandthetweepylibrary,foundathttps://github.com/tweepy/tweepy,tocollectTwitter’sdata.Thestream.pyfilefoundinthech1directoryofthebookcodearchiveinstantiatesalistenertothereal-timefirehose,grabsadatasample,andechoeseachtweet’stexttostandardoutput.

Thetweepylibrarycanbeinstalledusingeithertheeasy_installorpippackagemanagersorbycloningtherepositoryathttps://github.com/tweepy/tweepy.

OntheCDHQuickStartVM,wecaninstalltweepyusingthefollowingcommandline:

$pipinstalltweepy

Wheninvokedwiththe-jparameter,thescriptwilloutputaJSONtweettostandardoutput;-textractsandprintsthetextfield.Wespecifyhowmanytweetstoprintwith–n<numtweets>.When–nisnotspecified,thescriptwillrunindefinitely.ExecutioncanbeterminatedbypressingCtrl+C.

ThescriptexpectsOAuthcredentialstobestoredasshellenvironmentvariables;thefollowingcredentialswillhavetobesetintheterminalsessionfromwherestream.pywillbeexecuted.

$exportTWITTER_CONSUMER_KEY="your_consumer_key"

$exportTWITTER_CONSUMER_SECRET="your_consumer_secret"

$exportTWITTER_ACCESS_KEY="your_access_key"

$exportTWITTER_ACCESS_SECRET="your_access_secret"

OncetherequireddependencyhasbeeninstalledandtheOAuthdataintheshellenvironmenthasbeenset,wecanruntheprogramasfollows:

$pythonstream.py–t–n1000>tweets.txt

WearerelyingonLinux’sshellI/Otoredirecttheoutputwiththe>operatorofstream.pytoafilecalledtweets.txt.Ifeverythingwasexecutedcorrectly,youshouldseeawalloftext,whereeachlineisatweet.

Noticethatinthisexample,wedidnotmakeuseofHadoopatall.Inthenextchapters,wewillshowhowtoimportadatasetgeneratedfromtheStreamingAPIintoHadoopandanalyzeitscontentonthelocalclusterandAmazonEMR.

Fornow,let’stakealookatthesourcecodeofstream.py,whichcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch1/stream.py:

importtweepy

importos

importjson

importargparse

consumer_key=os.environ['TWITTER_CONSUMER_KEY']

consumer_secret=os.environ['TWITTER_CONSUMER_SECRET']

access_key=os.environ['TWITTER_ACCESS_KEY']

access_secret=os.environ['TWITTER_ACCESS_SECRET']

Page 74: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

classEchoStreamListener(tweepy.StreamListener):

def__init__(self,api,dump_json=False,numtweets=0):

self.api=api

self.dump_json=dump_json

self.count=0

self.limit=int(numtweets)

super(tweepy.StreamListener,self).__init__()

defon_data(self,tweet):

tweet_data=json.loads(tweet)

if'text'intweet_data:

ifself.dump_json:

printtweet.rstrip()

else:

printtweet_data['text'].encode("utf-8").rstrip()

self.count=self.count+1

returnFalseifself.count==self.limitelseTrue

defon_error(self,status_code):

returnTrue

defon_timeout(self):

returnTrue

if__name__=='__main__':

parser=get_parser()

args=parser.parse_args()

auth=tweepy.OAuthHandler(consumer_key,consumer_secret)

auth.set_access_token(access_key,access_secret)

api=tweepy.API(auth)

sapi=tweepy.streaming.Stream(

auth,EchoStreamListener(

api=api,

dump_json=args.json,

numtweets=args.numtweets))

sapi.sample()

First,weimportthreedependencies:tweepy,andtheosandjsonmodules,whichcomewiththePythoninterpreterversion2.6orgreater.

Wethendefineaclass,EchoStreamListener,thatinheritsandextendsStreamListenerfromtweepy.Asthenamesuggests,StreamListenerlistensforeventsandtweetsbeingpublishedonthereal-timestreamandperformsactionsaccordingly.

Wheneveraneweventisdetected,ittriggersacalltoon_data().Inthismethod,weextractthetextfieldfromatweetobjectandprintittostandardoutputwithUTF-8encoding.Alternatively,ifthescriptisinvokedwith-j,weprintthewholeJSONtweet.Whenthescriptisexecuted,weinstantiateatweepy.OAuthHandlerobjectwiththeOAuthcredentialsthatidentifyourTwitteraccount,andthenweusethisobjecttoauthenticatewiththeapplicationaccessandsecretkey.Wethenusetheauthobjecttocreateaninstanceofthetweepy.APIclass(api)

Page 75: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Uponsuccessfulauthentication,wetellPythontolistenforeventsonthereal-timestreamusingEchoStreamListener.

AnhttpGETrequesttothestatuses/sampleendpointisperformedbysample().Therequestreturnsarandomsampleofallpublicstatuses.

NoteBeware!Bydefault,sample()willrunindefinitely.RemembertoexplicitlyterminatethemethodcallbypressingCtrl+C.

Page 76: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryThischaptergaveawhirlwindtourofwhereHadoopcamefrom,itsevolution,andwhytheversion2releaseissuchamajormilestone.WealsodescribedtheemergingmarketinHadoopdistributionsandhowwewilluseacombinationoflocalandclouddistributionsinthebook.

Finally,wedescribedhowtosetuptheneededsoftware,accounts,andenvironmentsrequiredinsubsequentchaptersanddemonstratedhowtopulldatafromtheTwitterstreamthatwewilluseforexamples.

Withthisbackgroundoutoftheway,wewillnowmoveontoadetailedexaminationofthestoragelayerwithinHadoop.

Page 77: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter2.StorageAftertheoverviewofHadoopinthepreviouschapter,wewillnowstartlookingatitsvariouscomponentpartsinmoredetail.Wewillstartattheconceptualbottomofthestackinthischapter:themeansandmechanismsforstoringdatawithinHadoop.Inparticular,wewilldiscussthefollowingtopics:

DescribethearchitectureoftheHadoopDistributedFileSystem(HDFS)ShowwhatenhancementstoHDFShavebeenmadeinHadoop2ExplorehowtoaccessHDFSusingcommand-linetoolsandtheJavaAPIGiveabriefdescriptionofZooKeeper—another(sortof)filesystemwithinHadoopSurveyconsiderationsforstoringdatainHadoopandtheavailablefileformats

InChapter3,Processing–MapReduceandBeyond,wewilldescribehowHadoopprovidestheframeworktoallowdatatobeprocessed.

Page 78: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheinnerworkingsofHDFSInChapter1,Introduction,wegaveaveryhigh-leveloverviewofHDFS;wewillnowexploreitinalittlemoredetail.Asmentionedinthatchapter,HDFScanbeviewedasafilesystem,thoughonewithveryspecificperformancecharacteristicsandsemantics.It’simplementedwithtwomainserverprocesses:theNameNodeandtheDataNodes,configuredinamaster/slavesetup.IfyouviewtheNameNodeasholdingallthefilesystemmetadataandtheDataNodesasholdingtheactualfilesystemdata(blocks),thenthisisagoodstartingpoint.EveryfileplacedontoHDFSwillbesplitintomultipleblocksthatmightresideonnumerousDataNodes,andit’stheNameNodethatunderstandshowtheseblockscanbecombinedtoconstructthefiles.

Page 79: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClusterstartupLet’sexplorethevariousresponsibilitiesofthesenodesandthecommunicationbetweenthembyassumingwehaveanHDFSclusterthatwaspreviouslyshutdownandthenexaminingthestartupbehavior.

NameNodestartupWe’llfirstlyconsiderthestartupoftheNameNode(thoughthereisnoactualorderingrequirementforthisandwearedoingitfornarrativereasonsalone).TheNameNodeactuallystorestwotypesofdataaboutthefilesystem:

Thestructureofthefilesystem,thatis,directorynames,filenames,locations,andattributesTheblocksthatcompriseeachfileonthefilesystem

ThisdataisstoredinfilesthattheNameNodereadsatstartup.NotethattheNameNodedoesnotpersistentlystorethemappingoftheblocksthatarestoredonparticularDataNodes;we’llseehowthatinformationiscommunicatedshortly.

BecausetheNameNodereliesonthisin-memoryrepresentationofthefilesystem,ittendstohavequitedifferenthardwarerequirementscomparedtotheDataNodes.We’llexplorehardwareselectioninmoredetailinChapter10,RunningaHadoopCluster;fornow,justrememberthattheNameNodetendstobequitememoryhungry.Thisisparticularlytrueonverylargeclusterswithmany(millionsormore)files,particularlyifthesefileshaveverylongnames.ThisscalinglimitationontheNameNodehasalsoledtoanadditionalHadoop2featurethatwewillnotexploreinmuchdetail:NameNodefederation,wherebymultipleNameNodes(orNameNodeHApairs)workcollaborativelytoprovidetheoverallmetadataforthefullfilesystem.

ThemainfilewrittenbytheNameNodeiscalledfsimage;thisisthesinglemostimportantpieceofdataintheentirecluster,aswithoutit,theknowledgeofhowtoreconstructallthedatablocksintotheusablefilesystemislost.Thisfileisreadintomemoryandallfuturemodificationstothefilesystemareappliedtothisin-memoryrepresentationofthefilesystem.TheNameNodedoesnotwriteoutnewversionsoffsimageasnewchangesareappliedafteritisrun;instead,itwritesanotherfilecallededits,whichisalistofthechangesthathavebeenmadesincethelastversionoffsimagewaswritten.

TheNameNodestartupprocessistofirstreadthefsimagefile,thentoreadtheeditsfile,andapplyallthechangesstoredintheeditsfiletothein-memorycopyoffsimage.Itthenwritestodiskanewup-to-dateversionofthefsimagefileandisreadytoreceiveclientrequests.

DataNodestartupWhentheDataNodesstartup,theyfirstcatalogtheblocksforwhichtheyholdcopies.Typically,theseblockswillbewrittensimplyasfilesonthelocalDataNodefilesystem.

Page 80: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheDataNodewillperformsomeblockconsistencycheckingandthenreporttotheNameNodethelistofblocksforwhichithasvalidcopies.ThisishowtheNameNodeconstructsthefinalmappingitrequires—bylearningwhichblocksarestoredonwhichDataNodes.OncetheDataNodehasregistereditselfwiththeNameNode,anongoingseriesofheartbeatrequestswillbesentbetweenthenodestoallowtheNameNodetodetectDataNodesthathaveshutdown,becomeunreachable,orhavenewlyenteredthecluster.

Page 81: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BlockreplicationHDFSreplicateseachblockontomultipleDataNodes;thedefaultreplicationfactoris3,butthisisconfigurableonaper-filelevel.HDFScanalsobeconfiguredtobeabletodeterminewhethergivenDataNodesareinthesamephysicalhardwarerackornot.Givensmartblockplacementandthisknowledgeoftheclustertopology,HDFSwillattempttoplacethesecondreplicaonadifferenthostbutinthesameequipmentrackasthefirstandthethirdonahostoutsidetherack.Inthisway,thesystemcansurvivethefailureofasmuchasafullrackofequipmentandstillhaveatleastonelivereplicaforeachblock.Aswe’llseeinChapter3,Processing–MapReduceandBeyond,knowledgeofblockplacementalsoallowsHadooptoscheduleprocessingasnearaspossibletoareplicaofeachblock,whichcangreatlyimproveperformance.

Rememberthatreplicationisastrategyforresiliencebutisnotabackupmechanism;ifyouhavedatamasteredinHDFSthatiscritical,thenyouneedtoconsiderbackuporotherapproachesthatgiveprotectionforerrors,suchasaccidentallydeletedfiles,againstwhichreplicationwillnotdefend.

WhentheNameNodestartsupandisreceivingtheblockreportsfromtheDataNodes,itwillremaininsafemodeuntilaconfigurablethresholdofblocks(thedefaultis99.9percent)havebeenreportedaslive.Whileinsafemode,clientscannotmakeanymodificationstothefilesystem.

Page 82: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Command-lineaccesstotheHDFSfilesystemWithintheHadoopdistribution,thereisacommand-lineutilitycalledhdfs,whichistheprimarywaytointeractwiththefilesystemfromthecommandline.Runthiswithoutanyargumentstoseethevarioussubcommandsavailable.Therearemany,though;severalareusedtodothingslikestartingorstoppingvariousHDFScomponents.Thegeneralformofthehdfscommandis:

hdfs<sub-command><command>[arguments]

Thetwomainsubcommandswewilluseinthisbookare:

dfs:Thisisusedforgeneralfilesystemaccessandmanipulation,includingreading/writingandaccessingfilesanddirectoriesdfsadmin:Thisisusedforadministrationandmaintenanceofthefilesystem.Wewillnotcoverthiscommandindetail,though.Havealookatthe-reportcommand,whichgivesalistingofthestateofthefilesystemandallDataNodes:

$hdfsdfsadmin-report

NoteNotethatthedfsanddfsadmincommandscanalsobeusedwiththemainHadoopcommand-lineutility,forexample,hadoopfs-ls/.ThiswastheapproachinearlierversionsofHadoopbutisnowdeprecatedinfavorofthehdfscommand.

Page 83: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ExploringtheHDFSfilesystemRunthefollowingtogetalistoftheavailablecommandsprovidedbythedfssubcommand:

$hdfsdfs

Aswillbeseenfromtheoutputoftheprecedingcommand,manyoftheselooksimilartostandardUnixfilesystemcommandsand,notsurprisingly,theyworkaswouldbeexpected.InourtestVM,wehaveauseraccountcalledcloudera.Usingthisuser,wecanlisttherootofthefilesystemasfollows:

$hdfsdfs-ls/

Found7items

drwxr-xr-x-hbasehbase02014-04-0415:18/hbase

drwxr-xr-x-hdfssupergroup02014-10-2113:16/jar

drwxr-xr-x-hdfssupergroup02014-10-1515:26/schema

drwxr-xr-x-solrsolr02014-04-0415:16/solr

drwxrwxrwt-hdfssupergroup02014-11-1211:29/tmp

drwxr-xr-x-hdfssupergroup02014-07-1309:05/user

drwxr-xr-x-hdfssupergroup02014-04-0415:15/var

TheoutputisverysimilartotheUnixlscommand.Thefileattributesworkthesameastheuser/group/worldattributesonaUnixfilesystem(includingthetstickybitascanbeseen)plusdetailsoftheowner,group,andmodificationtimeofthedirectories.Thecolumnbetweenthegroupnameandthemodifieddateisthesize;thisis0fordirectoriesbutwillhaveavalueforfilesaswe’llseeinthecodefollowingthenextinformationbox:

NoteIfrelativepathsareused,theyaretakenfromthehomedirectoryoftheuser.Ifthereisnohomedirectory,wecancreateitusingthefollowingcommands:

$sudo-uhdfshdfsdfs–mkdir/user/cloudera

$sudo-uhdfshdfsdfs–chowncloudera:cloudera/user/cloudera

Themkdirandchownstepsrequiresuperuserprivileges(sudo-uhdfs).

$hdfsdfs-mkdirtestdir

$hdfsdfs-ls

Found1items

drwxr-xr-x-clouderacloudera02014-11-1311:21testdir

Then,wecancreateafile,copyittoHDFS,andreaditscontentsdirectlyfromitslocationonHDFS,asfollows:

$echo"Helloworld">testfile.txt

$hdfsdfs-puttestfile.txttestdir

Notethatthereisanoldercommandcalled-copyFromLocal,whichworksinthesamewayas-put;youmightseeitinolderdocumentationonline.Now,runthefollowingcommandandchecktheoutput:

Page 84: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

$hdfsdfs-lstestdir

Found1items

-rw-r--r--3clouderacloudera122014-11-1311:21

testdir/testfile.txt

Notethenewcolumnbetweenthefileattributesandtheowner;thisisthereplicationfactorofthefile.Now,finally,runthefollowingcommand:

$hdfsdfs-tailtestdir/testfile.txt

Helloworld

Muchoftherestofthedfssubcommandsareprettyintuitive;playaround.We’llexploresnapshotsandprogrammaticaccesstoHDFSlaterinthischapter.

Page 85: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ProtectingthefilesystemmetadataBecausethefsimagefileissocriticaltothefilesystem,itslossisacatastrophicfailure.InHadoop1,wheretheNameNodewasasinglepointoffailure,thebestpracticewastoconfiguretheNameNodetosynchronouslywritethefsimageandeditsfilestobothlocalstorageplusatleastoneotherlocationonaremotefilesystem(oftenNFS).IntheeventofNameNodefailure,areplacementNameNodecouldbestartedusingthisup-to-datecopyofthefilesystemmetadata.Theprocesswouldrequirenon-trivialmanualintervention,however,andwouldresultinaperiodofcompleteclusterunavailability.

Page 86: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SecondaryNameNodenottotherescueThemostunfortunatelynamedcomponentinallofHadoop1wastheSecondaryNameNode,which,notunreasonably,manypeopleexpecttobesomesortofbackuporstandbyNameNode.Itisnot;instead,theSecondaryNameNodewasresponsibleonlyforperiodicallyreadingthelatestversionofthefsimageandeditsfileandcreatinganewup-to-datefsimagewiththeoutstandingeditsapplied.Onabusycluster,thischeckpointcouldsignificantlyspeeduptherestartoftheNameNodebyreducingthenumberofeditsithadtoapplybeforebeingabletoserviceclients.

InHadoop2,thenamingismoreclear;thereareCheckpointnodes,whichdotherolepreviouslyperformedbytheSecondaryNameNode,plusBackupNameNodes,whichkeepalocalup-to-datecopyofthefilesystemmetadataeventhoughtheprocesstopromoteaBackupnodetobetheprimaryNameNodeisstillamultistagemanualprocess.

Page 87: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Hadoop2NameNodeHAInmostproductionHadoop2clusters,however,itmakesmoresensetousethefullHighAvailability(HA)solutioninsteadofrelyingonCheckpointandBackupnodes.ItisactuallyanerrortotrytocombineNameNodeHAwiththeCheckpointandBackupnodemechanisms.

Thecoreideaisforapair(currentlynomorethantwoaresupported)ofNameNodesconfiguredinanactive/passivecluster.OneNameNodeactsasthelivemasterthatservicesallclientrequests,andthesecondremainsreadytotakeovershouldtheprimaryfail.Inparticular,Hadoop2HDFSenablesthisHAthroughtwomechanisms:

ProvidingameansforbothNameNodestohaveconsistentviewsofthefilesystemProvidingameansforclientstoalwaysconnecttothemasterNameNode

KeepingtheHANameNodesinsyncThereareactuallytwomechanismsbywhichtheactiveandstandbyNameNodeskeeptheirviewsofthefilesystemconsistent;useofanNFSshareorQuorumJournalManager(QJM).

IntheNFScase,thereisanobviousrequirementonanexternalremoteNFSfileshare—notethatasuseofNFSwasbestpracticeinHadoop1forasecondcopyoffilesystemmetadatamanyclustersalreadyhaveone.Ifhighavailabilityisaconcern,thoughitshouldbeborneinmindthatmakingNFShighlyavailableoftenrequireshigh-endandexpensivehardware.InHadoop2,HAusesNFS;however,theNFSlocationbecomestheprimarylocationforthefilesystemmetadata.AstheactiveNameNodewritesallfilesystemchangestotheNFSshare,thestandbynodedetectsthesechangesandupdatesitscopyofthefilesystemmetadataaccordingly.

TheQJMmechanismusesanexternalservice(theJournalManagers)insteadofafilesystem.TheJournalManagerclusterisanoddnumberofservices(3,5,and7arethemostcommon)runningonthatnumberofhosts.AllchangestothefilesystemaresubmittedtotheQJMservice,andachangeistreatedascommittedonlywhenamajorityoftheQJMnodeshavecommittedthechange.ThestandbyNameNodereceiveschangeupdatesfromtheQJMserviceandusesthisinformationtokeepitscopyofthefilesystemmetadatauptodate.

TheQJMmechanismdoesnotrequireadditionalhardwareastheCheckpointnodesarelightweightandcanbeco-locatedwithotherservices.Thereisalsonosinglepointoffailureinthemodel.Consequently,theQJMHAisusuallythepreferredoption.

Ineithercase,bothinNFS-basedHAandQJM-basedHA,theDataNodessendblockstatusreportstobothNameNodestoensurethatbothhaveup-to-dateinformationofthemappingofblockstoDataNodes.Rememberthatthisblockassignmentinformationisnotheldinthefsimage/editsdata.

Page 88: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClientconfigurationTheclientstotheHDFSclusterremainmostlyunawareofthefactthatNameNodeHAisbeingused.TheconfigurationfilesneedtoincludethedetailsofbothNameNodes,butthemechanismsfordeterminingwhichistheactiveNameNode—andwhentoswitchtothestandby—arefullyencapsulatedintheclientlibraries.ThefundamentalconceptthoughisthatinsteadofreferringtoanexplicitNameNodehostasinHadoop1,HDFSinHadoop2identifiesanameserviceIDfortheNameNodewithinwhichmultipleindividualNameNodes(eachwithitsownNameNodeID)aredefinedforHA.NotethattheconceptofnameserviceIDisalsousedbyNameNodefederation,whichwebrieflymentionedearlier.

Page 89: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HowafailoverworksFailovercanbeeithermanualorautomatic.AmanualfailoverrequiresanadministratortotriggertheswitchthatpromotesthestandbytothecurrentlyactiveNameNode.Thoughautomaticfailoverhasthegreatestimpactonmaintainingsystemavailability,theremightbeconditionsinwhichthisisnotalwaysdesirable.Triggeringamanualfailoverrequiresrunningonlyafewcommandsand,therefore,eveninthismode,thefailoverissignificantlyeasierthaninthecaseofHadoop1orwithHadoop2Backupnodes,wherethetransitiontoanewNameNoderequiressubstantialmanualeffort.

Regardlessofwhetherthefailoveristriggeredmanuallyorautomatically,ithastwomainphases:confirmationthatthepreviousmasterisnolongerservingrequestsandthepromotionofthestandbytobethemaster.

ThegreatestriskinafailoveristohaveaperiodinwhichbothNameNodesareservicingrequests.Insuchasituation,itispossiblethatconflictingchangesmightbemadetothefilesystemonthetwoNameNodesorthattheymightbecomeoutofsync.EventhoughthisshouldnotbepossibleiftheQJMisbeingused(itonlyeveracceptsconnectionsfromasingleclient),out-of-dateinformationmightbeservedtoclients,whomightthentrytomakeincorrectdecisionsbasedonthisstalemetadata.Thisis,ofcourse,particularlylikelyifthepreviousmasterNameNodeisbehavingincorrectlyinsomeway,whichiswhytheneedforthefailoverisidentifiedinthefirstplace.

ToensureonlyoneNameNodeisactiveatanytime,afencingmechanismisusedtovalidatethattheexistingNameNodemasterhasbeenshutdown.ThesimplestincludedmechanismwilltrytosshintotheNameNodehostandactivelykilltheprocessthoughacustomscriptcanalsobeexecuted,sothemechanismisflexible.ThefailoverwillnotcontinueuntilthefencingissuccessfulandthesystemhasconfirmedthatthepreviousmasterNameNodeisnowdeadandhasreleasedanyrequiredresources.

Oncefencingsucceeds,thestandbyNameNodebecomesthemasterandwillstartwritingtotheNFS-mountedfsimageandeditslogsifNFSisbeingusedforHAorwillbecomethesingleclienttotheQJMifthatistheHAmechanism.

Beforediscussingautomaticfailover,weneedaslightseguetointroduceanotherApacheprojectthatisusedtoenablethisfeature.

Page 90: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheZooKeeper–adifferenttypeoffilesystemWithinHadoop,wewillmostlytalkaboutHDFSwhendiscussingfilesystemsanddatastorage.But,insidealmostallHadoop2installations,thereisanotherservicethatlookssomewhatlikeafilesystem,butwhichprovidessignificantcapabilitycrucialtotheproperfunctioningofdistributedsystems.ThisserviceisApacheZooKeeper(http://zookeeper.apache.org)and,asitisakeypartoftheimplementationofHDFSHA,wewillintroduceitinthischapter.Itis,however,alsousedbymultipleotherHadoopcomponentsandrelatedprojects,sowewilltouchonitseveralmoretimesthroughoutthebook.

ZooKeeperstartedoutasasubcomponentofHBaseandwasusedtoenableseveraloperationalcapabilitiesoftheservice.Whenanycomplexdistributedsystemisbuilt,thereareaseriesofactivitiesthatarealmostalwaysrequiredandwhicharealwaysdifficulttogetright.Theseactivitiesincludethingssuchashandlingsharedlocks,detectingcomponentfailure,andsupportingleaderelectionwithinagroupofcollaboratingservices.ZooKeeperwascreatedasthecoordinationservicethatwouldprovideaseriesofprimitiveoperationsuponwhichHBasecouldimplementthesetypesofoperationallycriticalfeatures.NotethatZooKeeperalsotakesinspirationfromtheGoogleChubbysystemdescribedathttp://research.google.com/archive/chubby-osdi06.pdf.

ZooKeeperrunsasaclusterofinstancesreferredtoasanensemble.Theensembleprovidesadatastructure,whichissomewhatanalogoustoafilesystem.EachlocationinthestructureiscalledaZNodeandcanhavechildrenasifitwereadirectorybutcanalsohavecontentasifitwereafile.NotethatZooKeeperisnotasuitableplacetostoreverylargeamountsofdata,andbydefault,themaximumamountofdatainaZNodeis1MB.Atanypointintime,oneserverintheensembleisthemasterandmakesalldecisionsaboutclientrequests.Thereareverywell-definedrulesaroundtheresponsibilitiesofthemaster,includingthatithastoensurethatarequestisonlycommittedwhenamajorityoftheensemblehavecommittedthechange,andthatoncecommittedanyconflictingchangeisrejected.

YoushouldhaveZooKeeperinstalledwithinyourClouderaVirtualMachine.Ifnot,useClouderaManagertoinstallitasasinglenodeonthehost.Inproductionsystems,ZooKeeperhasveryspecificsemanticsaroundabsolutemajorityvoting,sosomeofthelogiconlymakessenseinalargerensemble(3,5,or7nodesarethemostcommonsizes).

Thereisacommand-lineclienttoZooKeepercalledzookeeper-clientintheClouderaVM;notethatinthevanillaZooKeeperdistributionitiscalledzkCli.sh.Ifyourunitwithnoarguments,itwillconnecttotheZooKeeperserverrunningonthelocalmachine.Fromhere,youcantypehelptogetalistofcommands.

Themostimmediatelyinterestingcommandswillbecreate,ls,andget.Asthenamessuggest,thesecreateaZNode,listtheZNodesataparticularpointinthefilesystem,and

Page 91: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

getthedatastoredataparticularZNode.Herearesomeexamplesofusage.

CreateaZNodewithnodata:

$create/zk-test''

CreateachildofthefirstZNodeandstoresometextinit:

$create/zk-test/child1'sampledata'

RetrievethedataassociatedwithaparticularZNode:

$get/zk-test/child1

TheclientcanalsoregisterawatcheronagivenZNode—thiswillraiseanalertiftheZNodeinquestionchanges,eitheritsdataorchildrenbeingmodified.

Thismightnotsoundveryuseful,butZNodescanadditionallybecreatedasbothsequentialandephemeralnodes,andthisiswherethemagicstarts.

Page 92: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ImplementingadistributedlockwithsequentialZNodesIfaZNodeiscreatedwithintheCLIwiththe-soption,itwillbecreatedasasequentialnode.ZooKeeperwillsuffixthesuppliednamewitha10-digitintegerguaranteedtobeuniqueandgreaterthananyothersequentialchildrenofthesameZNode.Wecanusethismechanismtocreateadistributedlock.ZooKeeperitselfisnotholdingtheactuallock;theclientneedstounderstandwhatparticularstatesinZooKeepermeanintermsoftheirmappingtotheapplicationlocksinquestion.

Ifwecreatea(non-sequential)ZNodeat/zk-lock,thenanyclientwishingtoholdthelockwillcreateasequentialchildnode.Forexample,thecreate-s/zk-lock/locknodecommandmightcreatethenode,/zk-lock/locknode-0000000001,inthefirstcase,withincreasingintegersuffixesforsubsequentcalls.WhenaclientcreatesaZNodeunderthelock,itwillthencheckifitssequentialnodehasthelowestintegersuffix.Ifitdoes,thenitistreatedashavingthelock.Ifnot,thenitwillneedtowaituntilthenodeholdingthelockisdeleted.Theclientwillusuallyputawatchonthenodewiththenextlowestsuffixandthenbealertedwhenthatnodeisdeleted,indicatingthatitnowholdsthelock.

Page 93: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ImplementinggroupmembershipandleaderelectionusingephemeralZNodesAnyZooKeeperclientwillsendheartbeatstotheserverthroughoutthesession,showingthatitisalive.FortheZNodeswehavediscusseduntilnow,wecansaythattheyarepersistentandwillsurviveacrosssessions.Wecan,however,createaZNodeasephemeral,meaningitwilldisappearoncetheclientthatcreatediteitherdisconnectsorisdetectedasbeingdeadbytheZooKeeperserver.WithintheCLIanephemeralZNodeiscreatedbyaddingthe-eflagtothecreatecommand.

EphemeralZNodesareagoodmechanismtoimplementgroupmembershipdiscoverywithinadistributedsystem.Foranysystemwherenodescanfail,join,andleavewithoutnotice,knowingwhichnodesarealiveatanypointintimeisoftenadifficulttask.WithinZooKeeper,wecanprovidethebasisforsuchdiscoverybyhavingeachnodecreateanephemeralZNodeatacertainlocationintheZooKeeperfilesystem.TheZNodescanholddataabouttheservicenodes,suchashostname,IPaddress,portnumber,andsoon.Togetalistoflivenodes,wecansimplylistthechildnodesoftheparentgroupZNode.Becauseofthenatureofephemeralnodes,wecanhaveconfidencethatthelistoflivenodesretrievedatanytimeisuptodate.

IfwehaveeachservicenodecreateZNodechildrenthatarenotjustephemeralbutalsosequential,thenwecanalsobuildamechanismforleaderelectionforservicesthatneedtohaveasinglemasternodeatanyonetime.Themechanismisthesameforlocks;theclientservicenodecreatesthesequentialandephemeralZNodeandthenchecksifithasthelowestsequencenumber.Ifso,thenitisthemaster.Ifnot,thenitwillregisterawatcheronthenextlowestsequencenodetobealertedwhenitmightbecomethemaster.

Page 94: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

JavaAPITheorg.apache.zookeeper.ZooKeeperclassisthemainprogrammaticclienttoaccessaZooKeeperensemble.Refertothejavadocsforthefulldetails,butthebasicinterfaceisrelativelystraightforwardwithobviousone-to-onecorrespondencetocommandsintheCLI.Forexample:

create:isequivalenttoCLIcreategetChildren:isequivalenttoCLIlsgetData:isequivalenttoCLIget

Page 95: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BuildingblocksAscanbeseen,ZooKeeperprovidesasmallnumberofwell-definedoperationswithverystrongsemanticguaranteesthatcanbebuiltintohigher-levelservices,suchasthelocks,groupmembership,andleaderelectionwediscussedearlier.It’sbesttothinkofZooKeeperasatoolkitofwell-engineeredandreliablefunctionscriticaltodistributedsystemsthatcanbebuiltuponwithouthavingtoworryabouttheintricaciesoftheirimplementation.TheprovidedZooKeeperinterfaceisquitelow-levelthough,andthereareafewhigher-levelinterfacesemergingthatprovidemoreofthemappingofthelow-levelprimitivesintoapplication-levellogic.TheCuratorproject(http://curator.apache.org/)isagoodexampleofthis.

ZooKeeperwasusedsparinglywithinHadoop1,butit’snowquiteubiquitous.It’susedbybothMapReduceandHDFSforthehighavailabilityoftheirJobTrackerandNameNodecomponents.HiveandImpala,whichwewillexplorelater,useittoplacelocksondatatablesthatarebeingaccessedbymultipleconcurrentjobs.Kafka,whichwe’lldiscussinthecontextofSamza,usesZooKeeperfornode(brokerinKafkaterminology),leaderelection,andstatemanagement.

Page 96: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

FurtherreadingWehavenotdescribedZooKeeperinmuchdetailandhavecompletelyomittedaspectssuchasitsabilitytoapplyquotasandaccesscontrolliststoZNodeswithinthefilesystemandthemechanismstobuildcallbacks.OurpurposeherewastogiveenoughofthedetailssothatyouwouldhavesomeideaofhowitwasbeingusedwithintheHadoopservicesweexploreinthisbook.Formoreinformation,consulttheprojecthomepage.

Page 97: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AutomaticNameNodefailoverNowthatwehaveintroducedZooKeeper,wecanshowhowitisusedtoenableautomaticNameNodefailover.

AutomaticNameNodefailoverintroducestwonewcomponentstothesystem,aZooKeeperquorum,andtheZooKeeperFailoverController(ZKFC),whichrunsoneachNameNodehost.TheZKFCcreatesanephemeralZNodeinZooKeeperandholdsthisZNodeforaslongasitdetectsthelocalNameNodetobealiveandfunctioningcorrectly.Itdeterminesthisbycontinuouslysendingsimplehealth-checkrequeststotheNameNode,andiftheNameNodefailstorespondcorrectlyoverashortperiodoftimetheZKFCwillassumetheNameNodehasfailed.IfaNameNodemachinecrashesorotherwisefails,theZKFCsessioninZooKeeperwillbeclosedandtheephemeralZNodewillalsobeautomaticallyremoved.

TheZKFCprocessesarealsomonitoringtheZNodesoftheotherNameNodesinthecluster.IftheZKFConthestandbyNameNodehostseestheexistingmasterZNodedisappear,itwillassumethemasterhasfailedandwillattemptafailover.ItdoesthisbytryingtoacquirethelockfortheNameNode(throughtheprotocoldescribedintheZooKeepersection)andifsuccessfulwillinitiateafailoverthroughthesamefencing/promotionmechanismdescribedearlier.

Page 98: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HDFSsnapshotsWementionedearlierthatHDFSreplicationaloneisnotasuitablebackupstrategy.IntheHadoop2filesystem,snapshotshavebeenadded,whichbringsanotherlevelofdataprotectiontoHDFS.

Filesystemsnapshotshavebeenusedforsometimeacrossavarietyoftechnologies.Thebasicideaisthatitbecomespossibletoviewtheexactstateofthefilesystematparticularpointsintime.Thisisachievedbytakingacopyofthefilesystemmetadataatthepointthesnapshotismadeandmakingthisavailabletobeviewedinthefuture.

Aschangestothefilesystemaremade,anychangethatwouldaffectthesnapshotistreatedspecially.Forexample,ifafilethatexistsinthesnapshotisdeletedthen,eventhoughitwillberemovedfromthecurrentstateofthefilesystem,itsmetadatawillremaininthesnapshot,andtheblocksassociatedwithitsdatawillremainonthefilesystemthoughnotaccessiblethroughanyviewofthesystemotherthanthesnapshot.

Anexamplemightillustratethispoint.Say,youhaveafilesystemcontainingthefollowingfiles:

/data1(5blocks)

/data2(10blocks)

Youtakeasnapshotandthendeletethefile/data2.Ifyouviewthecurrentstateofthefilesystem,thenonly/data1willbevisible.Ifyouexaminethesnapshot,youwillseebothfiles.Behindthescenes,all15blocksstillexist,butonlythoseassociatedwiththeun-deletedfile/data1arepartofthecurrentfilesystem.Theblocksforthefile/data2willbereleasedonlywhenthesnapshotisitselfremoved—snapshotsareread-onlyviews.

SnapshotsinHadoop2canbeappliedateitherthefullfilesystemleveloronlyonparticularpaths.Apathneedstobesetassnapshottable,andnotethatyoucannothaveapathsnapshottableifanyofitschildrenorparentpathsarethemselvessnapshottable.

Let’stakeasimpleexamplebasedonthedirectorywecreatedearliertoillustratetheuseofsnapshots.Thecommandswearegoingtoillustrateneedtobeexecutedwithsuperuserprivileges,whichcanbeobtainedwithsudo-uhdfs.

First,usethedfsadminsubcommandofthehdfsCLIutilitytoenablesnapshotsofadirectory,asfollows:

$sudo-uhdfshdfsdfsadmin-allowSnapshot\

/user/cloudera/testdir

Allowingsnapshotontestdirsucceeded

Now,wecreatethesnapshotandexamineit;snapshotsareavailablethroughthe.snapshotsubdirectoryofthesnapshottabledirectory.Notethatthe.snapshotdirectorywillnotbevisibleinanormallistingofthedirectory.Here’showwecreateasnapshotandexamineit:

$sudo-uhdfshdfsdfs-createSnapshot\

Page 99: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

/user/cloudera/testdirsn1

Createdsnapshot/user/cloudera/testdir/.snapshot/sn1

$sudo-uhdfshdfsdfs-ls\

/user/cloudera/testdir/.snapshot/sn1

Found1items-rw-r--r--1clouderacloudera122014-11-1311:21

/user/cloudera/testdir/.snapshot/sn1/testfile.txt

Now,weremovethetestfilefromthemaindirectoryandverifythatitisnowempty:

$sudo-uhdfshdfsdfs-rm\

/user/cloudera/testdir/testfile.txt

14/11/1313:13:51INFOfs.TrashPolicyDefault:Namenodetrashconfiguration:

Deletioninterval=1440minutes,Emptierinterval=0minutes.Moved:

'hdfs://localhost.localdomain:8020/user/cloudera/testdir/testfile.txt'to

trashat:hdfs://localhost.localdomain:8020/user/hdfs/.Trash/Current

$hdfsdfs-ls/user/cloudera/testdir

$

Notethementionoftrashdirectories;bydefault,HDFSwillcopyanydeletedfilesintoa.Trashdirectoryintheuser’shomedirectory,whichhelpstodefendagainstslippingfingers.Thesefilescanberemovedthroughhdfsdfs-expungeorwillbeautomaticallypurgedin7daysbydefault.

Now,weexaminethesnapshotwherethenow-deletedfileisstillavailable:

$hdfsdfs-lstestdir/.snapshot/sn1

Found1itemsdrwxr-xr-x-clouderacloudera02014-11-1313:12

testdir/.snapshot/sn1

$hdfsdfs-tailtestdir/.snapshot/sn1/testfile.txt

Helloworld

Then,wecandeletethesnapshot,freeingupanyblocksheldbyit,asfollows:

$sudo-uhdfshdfsdfs-deleteSnapshot\

/user/cloudera/testdirsn1

$hdfsdfs-lstestdir/.snapshot

$

Ascanbeseen,thefileswithinasnapshotarefullyavailabletobereadandcopied,providingaccesstothehistoricalstateofthefilesystematthepointwhenthesnapshotwasmade.Eachdirectorycanhaveupto65,535snapshots,andHDFSmanagessnapshotsinsuchawaythattheyarequiteefficientintermsofimpactonnormalfilesystemoperations.Theyareagreatmechanismtousepriortoanyactivitythatmighthaveadverseeffects,suchastryinganewversionofanapplicationthataccessesthefilesystem.Ifthenewsoftwarecorruptsfiles,theoldstateofthedirectorycanberestored.Ifafteraperiodofvalidationthesoftwareisaccepted,thenthesnapshotcaninsteadbedeleted.

Page 100: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HadoopfilesystemsUntilnow,wereferredtoHDFSastheHadoopfilesystem.Inreality,Hadoophasaratherabstractnotionoffilesystem.HDFSisonlyoneofseveralimplementationsoftheorg.apache.hadoop.fs.FileSystemJavaabstractclass.Alistofavailablefilesystemscanbefoundathttps://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/fs/FileSystem.html.Thefollowingtablesummarizessomeofthesefilesystems,alongwiththecorrespondingURIschemeandJavaimplementationclass.

Filesystem URIscheme Javaimplementation

Local file org.apache.hadoop.fs.LocalFileSystem

HDFS hdfs org.apache.hadoop.hdfs.DistributedFileSystem

S3(native) s3n org.apache.hadoop.fs.s3native.NativeS3FileSystem

S3(block-based) s3 org.apache.hadoop.fs.s3.S3FileSystem

ThereexisttwoimplementationsoftheS3filesystem.Native—s3n—isusedtoreadandwriteregularfiles.Datastoredusings3ncanbeaccessedbyanytoolandconverselycanbeusedtoreaddatageneratedbyotherS3tools.s3ncannothandlefileslargerthan5TBorrenameoperations.

MuchlikeHDFS,theblock-basedS3filesystemstoresfilesinblocksandrequiresanS3buckettobededicatedtothefilesystem.FilesstoredinanS3filesystemcanbelargerthan5TB,buttheywillnotbeinteroperablewithotherS3tools.Additionallyblock-basedS3supportsrenameoperations.

Page 101: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HadoopinterfacesHadoopiswritteninJava,andnotsurprisingly,allinteractionwiththesystemhappensviatheJavaAPI.Thecommand-lineinterfaceweusedthroughthehdfscommandinpreviousexamplesisaJavaapplicationthatusestheFileSystemclasstocarryoutinput/outputoperationsontheavailablefilesystems.

JavaFileSystemAPITheJavaAPI,providedbytheorg.apache.hadoop.fspackage,exposesApacheHadoopfilesystems.

org.apache.hadoop.fs.FileSystemistheabstractclasseachfilesystemimplementsandprovidesageneralinterfacetointeractwithdatainHadoop.AllcodethatusesHDFSshouldbewrittenwiththecapabilityofhandlingaFileSystemobject.

LibhdfsLibhdfsisaClibrarythat,despiteitsname,canbeusedtoaccessanyHadoopfilesystemandnotjustHDFS.ItiswrittenusingtheJavaNativeInterface(JNI)andmimicstheJavaFileSystemclass.

ThriftApacheThrift(http://thrift.apache.org)isaframeworkforbuildingcross-languagesoftwarethroughdataserializationandremotemethodinvocationmechanisms.TheHadoopThriftAPI,availableincontrib,exposesHadoopfilesystemsasaThriftservice.Thisinterfacemakesiteasyfornon-JavacodetoaccessdatastoredinaHadoopfilesystem.

Otherthantheaforementionedinterfaces,thereexistotherinterfacesthatallowaccesstoHadoopfilesystemsviaHTTPandFTP—theseforHDFSonly—aswellasWebDAV.

Page 102: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ManagingandserializingdataHavingafilesystemisallwellandgood,butwealsoneedmechanismstorepresentdataandstoreitonthefilesystems.Wewillexploresomeofthesemechanismsnow.

Page 103: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheWritableinterfaceItisuseful,tousasdevelopers,ifwecanmanipulatehigher-leveldatatypesandhaveHadooplookaftertheprocessesrequiredtoserializethemintobytestowritetoafilesystemandreconstructfromastreamofbyteswhenitisreadfromthefilesystem.

Theorg.apache.hadoop.iopackagecontainstheWritableinterface,whichprovidesthismechanismandisspecifiedasfollows:

publicinterfaceWritable

{

voidwrite(DataOutputout)throwsIOException;

voidreadFields(DataInputin)throwsIOException;

}

Themainpurposeofthisinterfaceistoprovidemechanismsfortheserializationanddeserializationofdataasitispassedacrossthenetworkorreadandwrittenfromthedisk.

WhenweexploreprocessingframeworksonHadoopinlaterchapters,wewilloftenseeinstanceswheretherequirementisforadataargumenttobeofthetypeWritable.Ifweusedatastructuresthatprovideasuitableimplementationofthisinterface,thentheHadoopmachinerycanautomaticallymanagetheserializationanddeserializationofthedatatypewithoutknowinganythingaboutwhatitrepresentsorhowitisused.

Page 104: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

IntroducingthewrapperclassesFortunately,youdon’thavetostartfromscratchandbuildWritablevariantsofallthedatatypesyouwilluse.HadoopprovidesclassesthatwraptheJavaprimitivetypesandimplementtheWritableinterface.Theyareprovidedintheorg.apache.hadoop.iopackage.

Theseclassesareconceptuallysimilartotheprimitivewrapperclasses,suchasIntegerandLong,foundinjava.lang.Theyholdasingleprimitivevaluethatcanbeseteitheratconstructionorviaasettermethod.Theyareasfollows:

BooleanWritable

ByteWritable

DoubleWritable

FloatWritable

IntWritable

LongWritable

VIntWritable:avariablelengthintegertypeVLongWritable:avariablelengthlongtypeThereisalsoText,whichwrapsjava.lang.String.

Page 105: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ArraywrapperclassesHadoopalsoprovidessomecollection-basedwrapperclasses.TheseclassesprovideWritablewrappersforarraysofotherWritableobjects.Forexample,aninstancecouldeitherholdanarrayofIntWritableorDoubleWritable,butnotarraysoftherawintorfloattypes.AspecificsubclassfortherequiredWritableclasswillberequired.Theyareasfollows:

ArrayWritable

TwoDArrayWritable

Page 106: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheComparableandWritableComparableinterfacesWewereslightlyinaccuratewhenwesaidthatthewrapperclassesimplementWritable;theyactuallyimplementacompositeinterfacecalledWritableComparableintheorg.apache.hadoop.iopackagethatcombinesWritablewiththestandardjava.lang.Comparableinterface:

publicinterfaceWritableComparableextendsWritable,Comparable

{}

TheneedforComparablewillonlybecomeapparentwhenweexploreMapReduceinthenextchapter,butfornow,justrememberthatthewrapperclassesprovidemechanismsforthemtobebothserializedandsortedbyHadooporanyofitsframeworks.

Page 107: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StoringdataUntilnow,weintroducedthearchitectureofHDFSandhowtoprogrammaticallystoreandretrievedatausingthecommand-linetoolsandtheJavaAPI.Intheexamplesseenuntilnow,wehaveimplicitlyassumedthatourdatawasstoredasatextfile.Inreality,someapplicationsanddatasetswillrequireadhocdatastructurestoholdthefile’scontents.Overtheyears,fileformatshavebeencreatedtoaddressboththerequirementsofMapReduceprocessing—forinstance,wewantdatatobesplittable—andtosatisfytheneedtomodelbothstructuredandunstructureddata.Currently,alotoffocushasbeendedicatedtobettercapturetheusecasesofrelationaldatastorageandmodeling.Intheremainderofthischapter,wewillintroducesomeofthepopularfileformatchoicesavailablewithintheHadoopecosystem.

Page 108: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SerializationandContainersWhentalkingaboutfileformats,weareassumingtwotypesofscenarios,whichareasfollows:

Serialization:wewanttoencodedatastructuresgeneratedandmanipulatedatprocessingtimetoaformatwecanstoretoafile,transmit,andatalaterstage,retrieveandtranslatebackforfurthermanipulationContainers:oncedataisserializedtofiles,containersprovidemeanstogroupmultiplefilestogetherandaddadditionalmetadata

Page 109: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CompressionWhenworkingwithdata,filecompressioncanoftenleadtosignificantsavingsbothintermsofthespacenecessarytostorefilesaswellasonthedataI/Oacrossthenetworkandfrom/tolocaldisks.

Inbroadterms,whenusingaprocessingframework,compressioncanoccuratthreepointsintheprocessingpipeline:

inputfilestobeprocessedoutputfilesthatresultafterprocessingiscompletedintermediate/temporaryfilesproducedinternallywithinthepipeline

Whenweaddcompressionatanyofthesestages,wehaveanopportunitytodramaticallyreducetheamountofdatatobereadorwrittentothediskoracrossthenetwork.ThisisparticularlyusefulwithframeworkssuchasMapReducethatcan,forexample,producevolumesoftemporarydatathatarelargerthaneithertheinputoroutputdatasets.

ApacheHadoopcomeswithanumberofcompressioncodecs:gzip,bzip2,LZO,snappy—eachwithitsowntradeoffs.Pickingacodecisaneducatedchoicethatshouldconsiderboththekindofdatabeingprocessedaswellasthenatureoftheprocessingframeworkitself.

Otherthanthegeneralspace/timetradeoff,wherethelargestspacesavingscomeattheexpenseofcompressionanddecompressionspeed(andviceversa),weneedtotakeintoaccountthatdatastoredinHDFSwillbeaccessedbyparallel,distributedsoftware;someofthissoftwarewillalsoadditsownparticularrequirementsonfileformats.MapReduce,forexample,ismostefficientonfilesthatcanbesplitintovalidsubfiles.

Thiscancomplicatedecisions,suchasthechoiceofwhethertocompressandwhichcodectouseifso,asmostcompressioncodecs(suchasgzip)donotsupportsplittablefiles,whereasafew(suchasLZO)do.

Page 110: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

General-purposefileformatsThefirstclassoffileformatsarethosegeneral-purposeonesthatcanbeappliedtoanyapplicationdomainandmakenoassumptionsondatastructureoraccesspatterns.

Text:thesimplestapproachtostoringdataonHDFSistouseflatfiles.Textfilescanbeusedbothtoholdunstructureddata—awebpageoratweet—aswellasstructureddata—aCSVfilethatisafewmillionrowslong.Textfilesaresplittable,thoughoneneedstoconsiderhowtohandleboundariesbetweenmultipleelements(forexample,lines)inthefile.SequenceFile:aSequenceFileisaflatdatastructureconsistingofbinarykey/valuepairs,introducedtoaddressspecificrequirementsofMapReduce-basedprocessing.ItisstillextensivelyusedinMapReduceasaninput/outputformat.AswewillseeinChapter3,Processing–MapReduceandBeyond,internally,thetemporaryoutputsofmapsarestoredusingSequenceFile.

SequenceFileprovidesWriter,Reader,andSorterclassestowrite,read,and,sortdata,respectively.

Dependingonthecompressionmechanisminuse,threevariationsofSequenceFilecanbedistinguished:

Uncompressedkey/valuerecords.Recordcompressedkey/valuerecords.Only‘values’arecompressed.Blockcompressedkey/valuerecords.Keysandvaluesarecollectedinblocksofarbitrarysizeandcompressedseparately.

Ineachcase,however,theSequenceFileremainssplittable,whichisoneofitsbiggeststrengths.

Page 111: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Column-orienteddataformatsIntherelationaldatabaseworld,column-orienteddatastoresorganizeandstoretablesbasedonthecolumns;generallyspeaking,thedataforeachcolumnwillbestoredtogether.ThisisasignificantlydifferentapproachcomparedtomostrelationalDBMSthatorganizedataperrow.Column-orientedstoragehassignificantperformanceadvantages;forexample,ifaqueryneedstoreadonlytwocolumnsfromaverywidetablecontaininghundredsofcolumns,thenonlytherequiredcolumndatafilesareaccessed.Atraditionalrow-orienteddatabasewouldhavetoreadallcolumnsforeachrowforwhichdatawasrequired.Thishasthegreatestimpactonworkloadswhereaggregatefunctionsarecomputedoverlargenumbersofsimilaritems,suchaswithOLAPworkloadstypicalofdatawarehousesystems.

InChapter7,HadoopandSQL,wewillseehowHadoopisbecomingaSQLbackendforthedatawarehouseworldthankstoprojectssuchasApacheHiveandClouderaImpala.Aspartoftheexpansionintothisdomain,anumberoffileformatshavebeendevelopedtoaccountforbothrelationalmodelinganddatawarehousingneeds.

RCFile,ORC,andParquetarethreestate-of-the-artcolumn-orientedfileformatsdevelopedwiththeseusecasesinmind.

RCFileRowColumnarFile(RCFile)wasoriginallydevelopedbyFacebooktobeusedasthebackendstoragefortheirHivedatawarehousesystemthatwasthefirstmainstreamSQL-on-Hadoopsystemavailableasopensource.

RCFileaimstoprovidethefollowing:

fastdataloadingfastqueryprocessingefficientstorageutilizationadaptabilitytodynamicworkloads

MoreinformationonRCFilecanbefoundathttp://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/abs11-4.html.

ORCTheOptimizedRowColumnarfileformat(ORC)aimstocombinetheperformanceoftheRCFilewiththeflexibilityofAvro.ItisprimarilyintendedtoworkwithApacheHiveandhasbeeninitiallydevelopedbyHortonworkstoovercometheperceivedlimitationsofotheravailablefileformats.

Moredetailscanbefoundathttp://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html.

ParquetParquet,foundathttp://parquet.incubator.apache.org,wasoriginallyajointeffortof

Page 112: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Cloudera,Twitter,andCriteo,andnowhasbeendonatedtotheApacheSoftwareFoundation.ThegoalsofParquetaretoprovideamodern,performant,columnarfileformattobeusedwithClouderaImpala.AswithImpala,ParquethasbeeninspiredbytheDremelpaper(http://research.google.com/pubs/pub36632.html).Itallowscomplex,nesteddatastructuresandallowsefficientencodingonaper-columnlevel.

AvroApacheAvro(http://avro.apache.org)isaschema-orientedbinarydataserializationformatandfilecontainer.Avrowillbeourpreferredbinarydataformatthroughoutthisbook.Itisbothsplittableandcompressible,makingitanefficientformatfordataprocessingwithframeworkssuchasMapReduce.

Numerousotherprojectsalsohavebuilt-inspecificAvrosupportandintegration,however,soitisverywidelyapplicable.WhendataisstoredinanAvrofile,itsschema—definedasaJSONobject—isstoredwithit.Afilecanbelaterprocessedbyathirdpartywithnoapriorinotionofhowdataisencoded.Thismakesdataself-describingandfacilitatesusewithdynamicandscriptinglanguages.Theschema-on-readmodelalsohelpsAvrorecordstobeefficienttostoreasthereisnoneedfortheindividualfieldstobetagged.

Inlaterchapters,youwillseehowthesepropertiescanmakedatalifecyclemanagementeasierandallownon-trivialoperationssuchasschemamigrations.

UsingtheJavaAPIWe’llnowdemonstratetheuseoftheJavaAPItoparseAvroschemas,readandwriteAvrofiles,anduseAvro’scodegenerationfacilities.Notethattheformatisintrinsicallylanguageindependent;thereareAPIsformostlanguages,andfilescreatedbyJavawillseamlesslybereadfromanyotherlanguage.

AvroschemasaredescribedasJSONdocumentsandrepresentedbytheorg.apache.avro.Schemaclass.TodemonstratetheAPIformanipulatingAvrodocuments,we’lllookaheadtoanAvrospecificationweuseforaHivetableinChapter7,HadoopandSQL.Thefollowingcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/src/main/java/com/learninghadoop2/avro/AvroParse.java.

Inthefollowingcode,wewillusetheAvroJavaAPItocreateanAvrofilecontainingatweetrecordandthenre-readthefile,usingtheschemainthefiletoextractthedetailsofthestoredrecords:

publicstaticvoidtestGenericRecord(){

try{

Schemaschema=newSchema.Parser()

.parse(newFile("tweets_avro.avsc"));

GenericRecordtweet=newGenericData

.Record(schema);

tweet.put("text","Thegenerictweettext");

Page 113: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Filefile=newFile("tweets.avro");

DatumWriter<GenericRecord>datumWriter=

newGenericDatumWriter<>(schema);

DataFileWriter<GenericRecord>fileWriter=

newDataFileWriter<>(datumWriter);

fileWriter.create(schema,file);

fileWriter.append(tweet);

fileWriter.close();

DatumReader<GenericRecord>datumReader=

newGenericDatumReader<>(schema);

DataFileReader<GenericRecord>fileReader=

newDataFileReader(file,datumReader);

GenericRecordgenericTweet=null;

while(fileReader.hasNext()){

genericTweet=(GenericRecord)fileReader

.next(genericTweet);

for(Schema.Fieldfield:

genericTweet.getSchema().getFields()){

Objectval=genericTweet.get(field.name());

if(val!=null){

System.out.println(val);

}

}

}

}catch(IOExceptionie){

System.out.println("Errorparsingorwritingfile.");

}

}

Thetweets_avro.avscschema,foundathttps://github.com/learninghadoop2/book-examples/blob/master/ch2/tweets_avro.avsc,describesatweetwithmultiplefields.TocreateanAvroobjectofthistype,wefirstparsetheschemafile.WethenuseAvro’sconceptofaGenericRecordtobuildanAvrodocumentthatcomplieswiththisschema.Inthiscase,weonlysetasingleattribute—thetweettextitself.

TowritethisAvrofile—containingasingleobject—wethenuseAvro’sI/Ocapabilities.Toreadthefile,wedonotneedtostartwiththeschema,aswecanextractthisfromtheGenericRecordwereadfromthefile.Wethenwalkthroughtheschemastructureanddynamicallyprocessthedocumentbasedonthediscoveredfields.Thisisparticularlypowerful,asitisthekeyenablerofclientsremainingindependentoftheAvroschemaandhowitevolvesovertime.

Ifwehavetheschemafileinadvance,however,wecanuseAvrocodegenerationtocreateacustomizedclassthatmakesmanipulatingAvrorecordsmucheasier.Togeneratethecode,wewillusethecompileclassintheavro-tools.jar,passingitthenameoftheschemafileandthedesiredoutputdirectory:

$java-jar/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/avro/avro-

Page 114: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

tools.jarcompileschematweets_avro.avscsrc/main/java

Theclasswillbeplacedinadirectorystructurebasedonanynamespacedefinedintheschema.Sincewecreatedthisschemainthecom.learninghadoop2.avrotablesnamespace,weseethefollowing:

$lssrc/main/java/com/learninghadoop2/avrotables/tweets_avro.java

Withthisclass,let’srevisitthecreationandtheactofreadingandwritingAvroobjects,asfollows:

publicstaticvoidtestGeneratedCode(){

tweets_avrotweet=newtweets_avro();

tweet.setText("Thecodegeneratedtweettext");

try{

Filefile=newFile("tweets.avro");

DatumWriter<tweets_avro>datumWriter=

newSpecificDatumWriter<>(tweets_avro.class);

DataFileWriter<tweets_avro>fileWriter=

newDataFileWriter<>(datumWriter);

fileWriter.create(tweet.getSchema(),file);

fileWriter.append(tweet);

fileWriter.close();

DatumReader<tweets_avro>datumReader=

newSpecificDatumReader<>(tweets_avro.class);

DataFileReader<tweets_avro>fileReader=

newDataFileReader<>(file,datumReader);

while(fileReader.hasNext()){

tweet=fileReader.next(tweet);

System.out.println(tweet.getText());

}

}catch(IOExceptionie){

System.out.println("Errorinparsingorwritingfiles.");

}

}

Becauseweusedcodegeneration,wenowusetheAvroSpecificRecordmechanismalongsidethegeneratedclassthatrepresentstheobjectinourdomainmodel.Consequently,wecandirectlyinstantiatetheobjectandaccessitsattributesthroughfamiliarget/setmethods.

Writingthefileissimilartotheactionperformedbefore,exceptthatweusespecificclassesandalsoretrievetheschemadirectlyfromthetweetobjectwhenneeded.Readingissimilarlyeasedthroughtheabilitytocreateinstancesofaspecificclassanduseget/setmethods.

Page 115: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryThischapterhasgivenawhistle-stoptourthroughstorageonaHadoopcluster.Inparticular,wecovered:

Thehigh-levelarchitectureofHDFS,themainfilesystemusedinHadoopHowHDFSworksunderthecoversand,inparticular,itsapproachtoreliabilityHowHadoop2hasaddedsignificantlytoHDFS,particularlyintheformofNameNodeHAandfilesystemsnapshotsWhatZooKeeperisandhowitisusedbyHadooptoenablefeaturessuchasNameNodeautomaticfailoverAnoverviewofthecommand-linetoolsusedtoaccessHDFSTheAPIforfilesystemsinHadoopandhowatacodelevelHDFSisjustoneimplementationofamoreflexiblefilesystemabstractionHowdatacanbeserializedontoaHadoopfilesystemandsomeofthesupportprovidedinthecoreclassesThevariousfileformatsavailableinwhichdataismostfrequentlystoredinHadoopandsomeoftheirparticularusecases

Inthenextchapter,wewilllookindetailathowHadoopprovidesprocessingframeworksthatcanbeusedtoprocessthedatastoredwithinit.

Page 116: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter3.Processing–MapReduceandBeyondInHadoop1,theplatformhadtwoclearcomponents:HDFSfordatastorageandMapReducefordataprocessing.ThepreviouschapterdescribedtheevolutionofHDFSinHadoop2andinthischapterwe’lldiscussdataprocessing.

ThepicturewithprocessinginHadoop2haschangedmoresignificantlythanhasstorage,andHadoopnowsupportsmultipleprocessingmodelsasfirst-classcitizens.Inthischapterwe’llexplorebothMapReduceandothercomputationalmodelsinHadoop2.Inparticular,we’llcover:

WhatMapReduceisandtheJavaAPIrequiredtowriteapplicationsforitHowMapReduceisimplementedinpracticeHowHadoopreadsdataintoandoutofitsprocessingjobsYARN,theHadoop2componentthatallowsprocessingbeyondMapReduceontheplatformAnintroductiontoseveralcomputationalmodelsimplementedonYARN

Page 117: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapReduceMapReduceistheprimaryprocessingmodelsupportedinHadoop1.Itfollowsadivideandconquermodelforprocessingdatamadepopularbya2006paperbyGoogle(http://research.google.com/archive/mapreduce.html)andhasfoundationsbothinfunctionalprogramminganddatabaseresearch.Thenameitselfreferstotwodistinctstepsappliedtoallinputdata,amapfunctionandareducefunction.

EveryMapReduceapplicationisasequenceofjobsthatbuildatopthisverysimplemodel.Sometimes,theoverallapplicationmayrequiremultiplejobs,wheretheoutputofthereducestagefromoneistheinputtothemapstageofanother,andsometimestheremightbemultiplemaporreducefunctions,butthecoreconceptsremainthesame.

WewillintroducetheMapReducemodelbylookingatthenatureofthemapandreducefunctionsandthendescribetheJavaAPIrequiredtobuildimplementationsofthefunctions.Aftershowingsomeexamples,wewillwalkthroughaMapReduceexecutiontogivemoreinsightintohowtheactualMapReduceframeworkexecutescodeatruntime.

LearningtheMapReducemodelcanbealittlecounter-intuitive;it’softendifficulttoappreciatehowverysimplefunctionscan,whencombined,provideveryrichprocessingonenormousdatasets.Butitdoeswork,trustus!

Asweexplorethenatureofthemapandreducefunctions,thinkofthemasbeingappliedtoastreamofrecordsbeingretrievedfromthesourcedataset.We’lldescribehowthathappenslater;fornow,thinkofthesourcedatabeingslicedintosmallerchunks,eachofwhichgetsfedtoadedicatedinstanceofthemapfunction.Eachrecordhasthemapfunctionapplied,producingasetofintermediarydata.Recordsareretrievedfromthistemporarydatasetandallassociatedrecordsarefedtogetherthroughthereducefunction.Thefinaloutputofthereducefunctionforallthesetsofrecordsistheoverallresultforthecompletejob.

Fromafunctionalperspective,MapReducetransformsdatastructuresfromonelistof(key,value)pairsintoanother.DuringtheMapphase,dataisloadedfromHDFS,andafunctionisappliedinparalleltoeveryinput(key,value)andanewlistof(key,value)pairsistheoutput:

map(k1,v1)->list(k2,v2)

Theframeworkthencollectsallpairswiththesamekeyfromalllistsandgroupsthemtogether,creatingonegroupforeachkey.AReducefunctionisappliedinparalleltoeachgroup,whichinturnproducesalistofvalues:

reduce(k2,list(v2))→k3,list(v3)

TheoutputisthenwrittenbacktoHDFSinthefollowingmanner:

Page 118: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapandReducephases

Page 119: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

JavaAPItoMapReduceTheJavaAPItoMapReduceisexposedbytheorg.apache.hadoop.mapreducepackage.WritingaMapReduceprogram,atitscore,isamatterofsubclassingHadoop-providedMapperandReducerbaseclasses,andoverridingthemap()andreduce()methodswithourownimplementation.

Page 120: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheMapperclassForourownMapperimplementations,wewillsubclasstheMapperbaseclassandoverridethemap()method,asfollows:

classMapper<K1,V1,K2,V2>

{

voidmap(K1key,V1valueMapper.Contextcontext)

throwsIOException,InterruptedException

...

}

Theclassisdefinedintermsofthekey/valueinputandoutputtypes,andthenthemapmethodtakesaninputkey/valuepairasitsparameter.TheotherparameterisaninstanceoftheContextclassthatprovidesvariousmechanismstocommunicatewiththeHadoopframework,oneofwhichistooutputtheresultsofamaporreducemethod.

NoticethatthemapmethodonlyreferstoasingleinstanceofK1andV1key/valuepairs.ThisisacriticalaspectoftheMapReduceparadigminwhichyouwriteclassesthatprocesssinglerecords,andtheframeworkisresponsibleforalltheworkrequiredtoturnanenormousdatasetintoastreamofkey/valuepairs.Youwillneverhavetowritemaporreduceclassesthattrytodealwiththefulldataset.HadoopalsoprovidesmechanismsthroughitsInputFormatandOutputFormatclassesthatprovideimplementationsofcommonfileformatsandlikewiseremovetheneedforhavingtowritefileparsersforanybutcustomfiletypes.

Therearethreeadditionalmethodsthatsometimesmayberequiredtobeoverridden:.

protectedvoidsetup(Mapper.Contextcontext)

throwsIOException,InterruptedException

Thismethodiscalledoncebeforeanykey/valuepairsarepresentedtothemapmethod.Thedefaultimplementationdoesnothing:

protectedvoidcleanup(Mapper.Contextcontext)

throwsIOException,InterruptedException

Thismethodiscalledonceafterallkey/valuepairshavebeenpresentedtothemapmethod.Thedefaultimplementationdoesnothing:

protectedvoidrun(Mapper.Contextcontext)

throwsIOException,InterruptedException

ThismethodcontrolstheoverallflowoftaskprocessingwithinaJVM.Thedefaultimplementationcallsthesetupmethodoncebeforerepeatedlycallingthemapmethodforeachkey/valuepairinthesplitandthenfinallycallsthecleanupmethod.

Page 121: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheReducerclassTheReducerbaseclassworksverysimilarlytotheMapperclassandusuallyrequiresonlysubclassestooverrideasinglereduce()method.Hereisthecut-downclassdefinition:

publicclassReducer<K2,V2,K3,V3>

{

voidreduce(K2key,Iterable<V2>values,

Reducer.Contextcontext)

throwsIOException,InterruptedException

...

}

Again,noticetheclassdefinitionintermsofthebroaderdataflow(thereducemethodacceptsK2/V2asinputandprovidesK3/V3asoutput),whiletheactualreducemethodtakesonlyasinglekeyanditsassociatedlistofvalues.TheContextobjectisagainthemechanismtooutputtheresultofthemethod.

Thisclassalsohasthesetup,runandcleanupmethodswithsimilardefaultimplementationsaswiththeMapperclassthatcanoptionallybeoverridden:

protectedvoidsetup(Reducer.Contextcontext)

throwsIOException,InterruptedException

Thesetup()methodiscalledoncebeforeanykey/listsofvaluesarepresentedtothereducemethod.Thedefaultimplementationdoesnothing:

protectedvoidcleanup(Reducer.Contextcontext)

throwsIOException,InterruptedException

Thecleanup()methodiscalledonceafterallkey/listsofvalueshavebeenpresentedtothereducemethod.Thedefaultimplementationdoesnothing:

protectedvoidrun(Reducer.Contextcontext)

throwsIOException,InterruptedException

Therun()methodcontrolstheoverallflowofprocessingthetaskwithintheJVM.Thedefaultimplementationcallsthesetupmethodbeforerepeatedlyandpotentiallyconcurrentlycallingthereducemethodforasmanykey/valuepairsprovidedtotheReducerclass,andthenfinallycallsthecleanupmethod.

Page 122: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheDriverclassTheDriverclasscommunicateswiththeHadoopframeworkandspecifiestheconfigurationelementsneededtorunaMapReducejob.ThisinvolvesaspectssuchastellingHadoopwhichMapperandReducerclassestouse,wheretofindtheinputdataandinwhatformat,andwheretoplacetheoutputdataandhowtoformatit.

ThedriverlogicusuallyexistsinthemainmethodoftheclasswrittentoencapsulateaMapReducejob.ThereisnodefaultparentDriverclasstosubclass:

publicclassExampleDriverextendsConfiguredimplementsTool

{

...

publicstaticvoidrun(String[]args)throwsException

{

//CreateaConfigurationobjectthatisusedtosetotheroptions

Configurationconf=getConf();

//Getcommandlinearguments

args=newGenericOptionsParser(conf,args)

.getRemainingArgs();

//Createtheobjectrepresentingthejob

Jobjob=newJob(conf,"ExampleJob");

//Setthenameofthemainclassinthejobjarfile

job.setJarByClass(ExampleDriver.class);

//Setthemapperclass

job.setMapperClass(ExampleMapper.class);

//Setthereducerclass

job.setReducerClass(ExampleReducer.class);

//Setthetypesforthefinaloutputkeyandvalue

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

//Setinputandoutputfilepaths

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

//Executethejobandwaitforittocomplete

System.exit(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException

{

intexitCode=ToolRunner.run(newExampleDriver(),args);

System.exit(exitCode);

}

}

Intheprecedinglinesofcode,org.apache.hadoop.util.Toolisaninterfaceforhandlingcommand-lineoptions.TheactualhandlingisdelegatedtoToolRunner.run,whichruns

Page 123: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ToolwiththegivenConfigurationusedtogetandsetajob’sconfigurationoptions.Bysubclassingorg.apache.hadoop.conf.Configured,wecansettheConfigurationobjectdirectlyfromcommand-lineoptionsviaGenericOptionsParser.

Givenourprevioustalkofjobs,it’snotsurprisingthatmuchofthesetupinvolvesoperationsonajobobject.Thisincludessettingthejobnameandspecifyingwhichclassesaretobeusedforthemapperandreducerimplementations.

Certaininput/outputconfigurationsaresetand,finally,theargumentspassedtothemainmethodareusedtospecifytheinputandoutputlocationsforthejob.Thisisaverycommonmodelthatyouwillseeoften.

Thereareanumberofdefaultvaluesforconfigurationoptions,andweareimplicitlyusingsomeofthemintheprecedingclass.Mostnotably,wedon’tsayanythingabouttheformatoftheinputfilesorhowtheoutputfilesaretobewritten.ThesearedefinedthroughtheInputFormatandOutputFormatclassesmentionedearlier;wewillexplorethemindetaillater.Thedefaultinputandoutputformatsaretextfilesthatsuitourexamples.Therearemultiplewaysofexpressingtheformatwithintextfilesinadditiontoparticularlyoptimizedbinaryformats.

AcommonmodelforlesscomplexMapReducejobsistohavetheMapperandReducerclassesasinnerclasseswithinthedriver.Thisallowseverythingtobekeptinasinglefile,whichsimplifiesthecodedistribution.

Page 124: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CombinerHadoopallowstheuseofacombinerclasstoperformsomeearlysortingoftheoutputfromthemapmethodbeforeit’sretrievedbythereducer.

MuchofHadoop’sdesignispredicatedonreducingtheexpensivepartsofajobthatusuallyequatetodiskandnetworkI/O.Theoutputofthemapperisoftenlarge;it’snotinfrequenttoseeitmanytimesthesizeoftheoriginalinput.Hadoopdoesallowconfigurationoptionstohelpreducetheimpactofthereducerstransferringsuchlargechunksofdataacrossthenetwork.Thecombinertakesadifferentapproachwhereit’spossibletoperformearlyaggregationtorequirelessdatatobetransferredinthefirstplace.

Thecombinerdoesnothaveitsowninterface;acombinermusthavethesamesignatureasthereducer,andhencealsosubclassestheReduceclassfromtheorg.apache.hadoop.mapreducepackage.Theeffectofthisistobasicallyperformamini-reduceonthemapperfortheoutputdestinedforeachreducer.

Hadoopdoesnotguaranteewhetherthecombinerwillbeexecuted.Sometimes,itmaynotbeexecutedatall,whileatothertimesitmaybeusedonce,twice,ormoretimesdependingonthesizeandnumberofoutputfilesgeneratedbythemapperforeachreducer.

Page 125: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PartitioningOneoftheimplicitguaranteesoftheReduceinterfaceisthatasinglereducerwillbegivenallthevaluesassociatedwithagivenkey.Withmultiplereducetasksrunningacrossacluster,eachmapperoutputmustbepartitionedintotheseparateoutputsdestinedforeachreducer.Thesepartitionedfilesarestoredonthelocalnodefilesystem.

Thenumberofreducetasksacrosstheclusterisnotasdynamicasthatofmappers,andindeedwecanspecifythevalueaspartofourjobsubmission.Hadooptherefore,knowshowmanyreducerswillbeneededtocompletethejob,andfromthis,itknowsintohowmanypartitionsthemapperoutputshouldbesplit.

TheoptionalpartitionfunctionWithintheorg.apache.hadoop.mapreducepackageisthePartitionerclass,anabstractclasswiththefollowingsignature:

publicabstractclassPartitioner<Key,Value>

{

publicabstractintgetPartition(Keykey,Valuevalue,int

numPartitions);

}

Bydefault,Hadoopwilluseastrategythathashestheoutputkeytoperformthepartitioning.ThisfunctionalityisprovidedbytheHashPartitionerclasswithintheorg.apache.hadoop.mapreduce.lib.partitionpackage,butit’snecessaryinsomecasestoprovideacustomsubclassofPartitionerwithapplication-specificpartitioninglogic.NoticethatthegetPartitionfunctiontakesthekey,value,andnumberofpartitionsasparameters,anyofwhichcanbeusedbythecustompartitioninglogic.

Acustompartitioningstrategywouldbeparticularlynecessaryif,forexample,thedataprovidedaveryunevendistributionwhenthestandardhashfunctionwasapplied.Unevenpartitioningcanresultinsometaskshavingtoperformsignificantlymoreworkthanothers,leadingtomuchlongeroveralljobexecutiontime.

Page 126: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Hadoop-providedmapperandreducerimplementationsWedon’talwayshavetowriteourownMapperandReducerclassesfromscratch.HadoopprovidesseveralcommonMapperandReducerimplementationsthatcanbeusedinourjobs.Ifwedon’toverrideanyofthemethodsintheMapperandReducerclasses,thedefaultimplementationsaretheidentityMapperandReducerclasses,whichsimplyoutputtheinputunchanged.

Themappersarefoundatorg.apache.hadoop.mapreduce.lib.mapperandincludethefollowing:

InverseMapper:returns(value,key)asanoutput,thatis,theinputkeyisoutputasthevalueandtheinputvalueisoutputasthekeyTokenCounterMapper:countsthenumberofdiscretetokensineachlineofinputIdentityMapper:implementstheidentityfunction,mappinginputsdirectlytooutputs

Thereducersarefoundatorg.apache.hadoop.mapreduce.lib.reduceandcurrentlyincludethefollowing:

IntSumReducer:outputsthesumofthelistofintegervaluesperkeyLongSumReducer:outputsthesumofthelistoflongvaluesperkeyIdentityReducer:implementstheidentityfunction,mappinginputsdirectlytooutputs

Page 127: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SharingreferencedataOccasionally,wemightwanttosharedataacrosstasks.Forinstance,ifweneedtoperformalookupoperationonanID-to-stringtranslationtable,wemightwantsuchadatasourcetobeaccessiblebythemapperorreducer.AstraightforwardapproachistostorethedatawewanttoaccessonHDFSandusetheFileSystemAPItoqueryitaspartoftheMaporReducesteps.

Hadoopgivesusanalternativemechanismtoachievethegoalofsharingreferencedataacrossalltasksinthejob,theDistributedCachedefinedbytheorg.apache.hadoop.mapreduce.filecache.DistributedCacheclass.Thiscanbeusedtoefficientlymakeavailablecommonread-onlyfilesthatareusedbythemaporreducetaskstoallnodes.

Thefilescanbetextdataasinthiscase,butcouldalsobeadditionalJARs,binarydata,orarchives;anythingispossible.ThefilestobedistributedareplacedonHDFSandaddedtotheDistributedCachewithinthejobdriver.Hadoopcopiesthefilesontothelocalfilesystemofeachnodepriortojobexecution,meaningeverytaskhaslocalaccesstothefiles.

AnalternativeistobundleneededfilesintothejobJARsubmittedtoHadoop.ThisdoestiethedatatothejobJAR,makingitmoredifficulttoshareacrossjobsandrequirestheJARtoberebuiltifthedatachanges.

Page 128: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WritingMapReduceprogramsInthischapter,wewillbefocusingonbatchworkloads;givenasetofhistoricaldata,wewilllookatpropertiesofthatdataset.InChapter4,Real-timeComputationwithSamza,andChapter5,IterativeComputationwithSpark,wewillshowhowasimilartypeofanalysiscanbeperformedoverastreamoftextcollectedinrealtime.

Page 129: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingstartedInthefollowingexamples,wewillassumeadatasetgeneratedbycollecting1,000tweetsusingthestream.pyscript,asshowninChapter1,Introduction:

$pythonstream.py–t–n1000>tweets.txt

WecanthencopythedatasetintoHDFSwith:

$hdfsdfs-puttweets.txt<destination>

TipNotethatuntilnowwehavebeenworkingonlywiththetextoftweets.Intheremainderofthisbook,we’llextendstream.pytooutputadditionaltweetmetadatainJSONformat.Keepthisinmindbeforedumpingterabytesofmessageswithstream.py.

OurfirstMapReduceprogramwillbethecanonicalWordCountexample.Avariationofthisprogramwillbeusedtodeterminetrendingtopics.Wewillthenanalyzetextassociatedwithtopicstodeterminewhetheritexpressesa“positive”or“negative”sentiment.Finally,wewillmakeuseofaMapReducepattern—ChainMapper—topullthingstogetherandpresentadatapipelinetocleanandpreparethetextualdatawe’llfeedtothetrendingtopicandsentimentanalysismodel.

Page 130: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

RunningtheexamplesThefullsourcecodeoftheexamplesdescribedinthissectioncanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch3.

BeforewerunourjobinHadoop,wemustcompileourcodeandcollecttherequiredclassfilesintoasingleJARfilethatwewillsubmittothesystem.UsingGradle,youcanbuildtheneededJARfilewith:

$./gradlewjar

LocalclusterJobsareexecutedonHadoopusingtheJARoptiontotheHadoopcommand-lineutility.Tousethis,wespecifythenameoftheJARfile,themainclasswithinit,andanyargumentsthatwillbepassedtothemainclass,asshowninthefollowingcommand:

$hadoopjar<jobjarfile><mainclass><argument1>…<argument2>

ElasticMapReduceRecallfromChapter1,Introduction,thatElasticMapReduceexpectsthejobJARfileanditsinputdatatobelocatedinanS3bucketandconverselywilldumpitsoutputbackintoS3.

NoteBecareful:thiswillcostmoney!Forthisexample,wewillusethesmallestpossibleclusterconfigurationavailableforEMR,asingle-nodecluster

Firstofall,wewillcopythetweetdatasetandthelistofpositiveandnegativewordstoS3usingtheawscommand-lineutility:

$awss3puttweets.txts3://<bucket>/input

$awss3putjob.jars3://<bucket>

WecanexecuteajobusingtheEMRcommand-linetoolasfollowsbyuploadingtheJARfiletos3://<bucket>andaddingCUSTOM_JARstepswiththeawsCLI:

$awsemradd-steps--cluster-id<cluster-id>--steps\

Type=CUSTOM_JAR,\

Name=CustomJAR,\

Jar=s3://<bucket>/job.jar,\

MainClass=<classname>,\

Args=arg1,arg2,…argN

Here,cluster-idistheIDofarunningEMRcluster,<classname>isthefullyqualifiednameofthemainclass,andarg1,arg2,…,argNarethejobarguments.

Page 131: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WordCount,theHelloWorldofMapReduceWordCountcountswordoccurrencesinadataset.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/WordCount.javaConsiderthefollowingblockofcodeforexample:

publicclassWordCountextendsConfiguredimplementsTool

{

publicstaticclassWordCountMapper

extendsMapper<Object,Text,Text,IntWritable>

{

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext

)throwsIOException,InterruptedException{

String[]words=value.toString().split("");

for(Stringstr:words)

{

word.set(str);

context.write(word,one);

}

}

}

publicstaticclassWordCountReducer

extendsReducer<Text,IntWritable,Text,IntWritable>{

publicvoidreduce(Textkey,Iterable<IntWritable>values,

Contextcontext

)throwsIOException,InterruptedException{

inttotal=0;

for(IntWritableval:values){

total++;

}

context.write(key,newIntWritable(total));

}

}

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args)

.getRemainingArgs();

Jobjob=Job.getInstance(conf);

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

Page 132: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

return(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newWordCount(),args);

System.exit(exitCode);

}

}

ThisisourfirstcompleteMapReducejob.Lookatthestructure,andyoushouldrecognizetheelementswehavepreviouslydiscussed:theoverallJobclasswiththedriverconfigurationinitsmainmethodandtheMapperandReducerimplementationsdefinedasstaticnestedclasses.

We’lldoamoredetailedwalkthroughofthemechanicsofMapReduceinthenextsection,butfornow,let’slookattheprecedingcodeandthinkofhowitrealizesthekey/valuetransformationswediscussedearlier.

TheinputtotheMapperclassisarguablythehardesttounderstand,asthekeyisnotactuallyused.ThejobspecifiesTextInputFormatastheformatoftheinputdataand,bydefault,thisdeliverstothemapperdatawherethekeyisthebyteoffsetinthefileandthevalueisthetextofthatline.Inreality,youmayneveractuallyseeamapperthatusesthatbyteoffsetkey,butit’sprovided.

Themapperisexecutedonceforeachlineoftextintheinputsource,andeverytimeittakesthelineandbreaksitintowords.ItthenusestheContextobjecttooutput(morecommonlyknownasemitting)eachnewkey/valueoftheform(word,1).TheseareourK2/V2values.

Wesaidbeforethattheinputtothereducerisakeyandacorrespondinglistofvalues,andthereissomemagicthathappensbetweenthemapandreducemethodstocollectthevaluesforeachkeythatfacilitatesthis—calledtheshufflestage,whichwewon’tdescriberightnow.Hadoopexecutesthereduceronceforeachkey,andtheprecedingreducerimplementationsimplycountsthenumbersintheIterableobjectandgivesoutputforeachwordintheformof(word,count).TheseareourK3/V3values.

Takealookatthesignaturesofourmapperandreducerclasses:theWordCountMapperclassacceptsIntWritableandTextasinputandprovidesTextandIntWritableasoutput.TheWordCountReducerclasshasTextandIntWritableacceptedasbothinputandoutput.Thisisagainquiteacommonpattern,wherethemapmethodperformsaninversiononthekeyandvalues,andinsteademitsaseriesofdatapairsonwhichthereducerperformsaggregation.

Thedriverismoremeaningfulhere,aswehaverealvaluesfortheparameters.Weuseargumentspassedtotheclasstospecifytheinputandoutputlocations.

Runthejobwith:

$hadoopjarbuild/libs/mapreduce-example.jar

com.learninghadoop2.mapreduce.WordCount\

twitter.txtoutput

Page 133: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Examinetheoutputwithacommandsuchasthefollowing;theactualfilenamemightbedifferent,sojustlookinsidethedirectorycalledoutputinyourhomedirectoryonHDFS:

$hdfsdfs-catoutput/part-r-00000

Page 134: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Wordco-occurrencesWordsoccurringtogetherarelikelytobephrasesandcommon—frequentlyoccurring—phrasesarelikelytobeimportant.InNaturalLanguageProcessing,alistofco-occurringtermsiscalledanN-Gram.N-Gramsarethefoundationofseveralstatisticalmethodsfortextanalytics.WewillgiveanexampleofthespecialcaseofanN-Gram—andametricoftenencounteredinanalyticsapplications—composedoftwoterms(abigram).

AnaïveimplementationinMapReducewouldbeanextensionofWordCountthatemitsamulti-fieldkeycomposedoftwotab-separatedwords.

publicclassBiGramCountextendsConfiguredimplementsTool

{

publicstaticclassBiGramMapper

extendsMapper<Object,Text,Text,IntWritable>{

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext

)throwsIOException,InterruptedException{

String[]words=value.toString().split("");

Textbigram=newText();

Stringprev=null;

for(Strings:words){

if(prev!=null){

bigram.set(prev+"\t+\t"+s);

context.write(bigram,one);

}

prev=s;

}

}

}

@Override

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args).getRemainingArgs();

Jobjob=Job.getInstance(conf);

job.setJarByClass(BiGramCount.class);

job.setMapperClass(BiGramMapper.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

return(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newBiGramCount(),args);

Page 135: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

System.exit(exitCode);

}

}

Inthisjob,wereplaceWordCountReducerwithorg.apache.hadoop.mapreduce.lib.reduce.IntSumReducer,whichimplementsthesamelogic.Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/BiGramCount.java

Page 136: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TrendingtopicsThe#symbol,calledahashtag,isusedtomarkkeywordsortopicsinatweet.ItwascreatedorganicallybyTwitterusersasawaytocategorizemessages.TwitterSearch(foundathttps://twitter.com/search-home)popularizedtheuseofhashtagsasamethodtoconnectandfindcontentrelatedtospecifictopicsaswellasthepeopletalkingaboutsuchtopics.Bycountingthefrequencywithwhichahashtagismentionedoveragiventimeperiod,wecandeterminewhichtopicsaretrendinginthesocialnetwork.

publicclassHashTagCountextendsConfiguredimplementsTool

{

publicstaticclassHashTagCountMapper

extendsMapper<Object,Text,Text,IntWritable>

{

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

privateStringhashtagRegExp=

"(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)";

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

String[]words=value.toString().split("");

for(Stringstr:words)

{

if(str.matches(hashtagRegExp)){

word.set(str);

context.write(word,one);

}

}

}

}

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args)

.getRemainingArgs();

Jobjob=Job.getInstance(conf);

job.setJarByClass(HashTagCount.class);

job.setMapperClass(HashTagCountMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

return(job.waitForCompletion(true)?0:1);

Page 137: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(newHashTagCount(),args);

System.exit(exitCode);

}

}

AsintheWordCountexample,wetokenizetextintheMapper.Weusearegularexpression—hashtagRegExp—todetectthepresenceofahashtaginTwitter’stextandemitthehashtagandthenumber1whenahashtagisfound.IntheReducerstep,wethencountthetotalnumberofemittedhashtagoccurrencesusingIntSumReducer.

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagCount.java

ThiscompiledclasswillbeintheJARfilewebuiltwithGradleearlier,sonowweexecuteHashTagCountwiththefollowingcommand:

$hadoopjarbuild/libs/mapreduce-example.jar\

com.learninghadoop2.mapreduce.HashTagCounttwitter.txtoutput

Let’sexaminetheoutputasbefore:

$hdfsdfs-catoutput/part-r-00000

Youshouldseeoutputsimilartothefollowing:

#whey1

#willpower1

#win2

#winterblues1

#winterstorm1

#wipolitics1

#women6

#woodgrain1

Eachlineiscomposedofahashtagandthenumberoftimesitappearsinthetweetsdataset.Asyoucansee,theMapReducejobordersresultsbykey.Ifwewanttofindthemostmentionedtopics,weneedtoordertheresultset.Thenaïveapproachwouldbetoperformatotalorderoftheaggregatedvaluesandselectingthetop10.

Iftheoutputdatasetissmall,wecanpipeittostandardoutputandsortitusingthesortutility:

$hdfsdfs-catoutput/part-r-00000|sort-k2-n-r|head-n10

AnothersolutionwouldbetowriteanotherMapReducejobtotraversethewholeresultsetandsortbyvalue.Whendatabecomeslarge,thistypeofglobalsortingcanbecomequiteexpensive.Inthefollowingsection,wewillillustrateanefficientdesignpatterntosortaggregateddata

TheTopNpattern

Page 138: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

IntheTopNpattern,wekeepdatasortedinalocaldatastructure.EachmappercalculatesalistofthetopNrecordsinitssplitandsendsitslisttothereducer.AsinglereducertaskfindsthetopNglobalrecords.

WewillapplythisdesignpatterntoimplementaTopTenHashTagjobthatfindsthetoptentopicsinourdataset.ThejobtakesasinputtheoutputdatageneratedbyHashTagCountandreturnsalistofthetenmostfrequentlymentionedhashtags.

InTopTenMapperweuseTreeMaptokeepasortedlist—inascendingorder—ofhashtags.Thekeyofthismapisthenumberofoccurrences;thevalueisatab-separatedstringofhashtagsandtheirfrequency.Inmap(),foreachvalue,weupdatethetopNmap.WhentopNhasmorethantenitems,weremovethesmallest:

publicstaticclassTopTenMapperextendsMapper<Object,Text,

NullWritable,Text>{

privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();

privatefinalstaticIntWritableone=newIntWritable(1);

privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext)throws

IOException,InterruptedException{

String[]words=value.toString().split("\t");

if(words.length<2){

return;

}

topN.put(Integer.parseInt(words[1]),newText(value));

if(topN.size()>10){

topN.remove(topN.firstKey());

}

}

@Override

protectedvoidcleanup(Contextcontext)throwsIOException,

InterruptedException{

for(Textt:topN.values()){

context.write(NullWritable.get(),t);

}

}

}

Wedon’temitanykey/valueinthemapfunction.Weimplementacleanup()methodthat,oncethemapperhasconsumedallitsinput,emitsthe(hashtag,count)valuesintopN.WeuseaNullWritablekeybecausewewantallvaluestobeassociatedwiththesamekeysothatwecanperformaglobalorderoverallmappers’topnlists.Thisimpliesthatourjobwillexecuteonlyonereducer.

Thereducerimplementslogicsimilartowhatwehaveinmap().WeinstantiateTreeMapanduseittokeepanorderedlistofthetop10values:

publicstaticclassTopTenReducerextends

Reducer<NullWritable,Text,NullWritable,Text>{

privateTreeMap<Integer,Text>topN=newTreeMap<Integer,Text>();

Page 139: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

@Override

publicvoidreduce(NullWritablekey,Iterable<Text>values,Context

context)throwsIOException,InterruptedException{

for(Textvalue:values){

String[]words=value.toString().split("\t");

topN.put(Integer.parseInt(words[1]),

newText(value));

if(topN.size()>10){

topN.remove(topN.firstKey());

}

}

for(Textword:topN.descendingMap().values()){

context.write(NullWritable.get(),word);

}

}

}

Finally,wetraversetopNindescendingordertogeneratethelistoftrendingtopics.

NoteNotethatinthisimplementation,weoverridehashtagsthathaveafrequencyvaluealreadypresentinTreeMapwhencallingtopN.put().Dependingontheusecase,it’sadvisedtouseadifferentdatastructure—suchastheonesofferedbytheGuavalibrary(https://code.google.com/p/guava-libraries/)—oradjusttheupdatingstrategy.

Inthedriver,weenforceasinglereducerbysettingjob.setNumReduceTasks(1):

$hadoopjarbuild/libs/mapreduce-example.jar\

com.learninghadoop2.mapreduce.TopTenHashTag\

output/part-r-00000\

top-ten

Wecaninspectthetoptentolisttrendingtopics:

$hdfsdfs-cattop-ten/part-r-00000

#Stalker48150

#gameinsight55

#12M52

#KCA46

#LORDJASONJEROME29

#Valencia19

#LesAnges616

#VoteLuan15

#hadoop212

#Gameinsight11

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/TopTenHashTag.java

Page 140: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SentimentofhashtagsTheprocessofidentifyingsubjectiveinformationinadatasourceiscommonlyreferredtoassentimentanalysis.Inthepreviousexample,weshowhowtodetecttrendingtopicsinasocialnetwork;we’llnowanalyzethetextsharedaroundthosetopicstodeterminewhethertheyexpressamostlypositiveornegativesentiment.

AlistofpositiveandnegativewordsfortheEnglishlanguage—aso-calledopinionlexicon—canbefoundathttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar.

NoteTheseresources—andmanymore—havebeencollectedbyProf.BingLiu’sgroupattheUniversityofIllinoisatChicagoandhavebeenused,amongothers,inBingLiu,MinqingHuandJunshengCheng.“OpinionObserver:AnalyzingandComparingOpinionsontheWeb.”Proceedingsofthe14thInternationalWorldWideWebconference(WWW-2005),May10-14,2005,Chiba,Japan.

Inthisexample,we’llpresentabag-of-wordsmethodthat,althoughsimplisticinnature,canbeusedasabaselinetomineopinionintext.Foreachtweetandeachhashtag,wewillcountthenumberoftimesapositiveoranegativewordappearsandnormalizethiscountbythetextlength.

NoteThebag-of-wordsmodelisanapproachusedinNaturalLanguageProcessingandInformationRetrievaltorepresenttextualdocuments.Inthismodel,textisrepresentedasthesetorbag—withmultiplicity—ofitswords,disregardinggrammarandmorphologicalpropertiesandevenwordorder.

UncompressthearchiveandplacethewordlistsintoHDFSwiththefollowingcommandline:

$hdfsdfs–putpositive-words.txt<destination>

$hdfsdfs–putnegative-words.txt<destination>

IntheMapperclass,wedefinetwoobjectsthatwillholdthewordlists:positiveWordsandnegativeWordsasSet<String>:

privateSet<String>positiveWords=null;

privateSet<String>negativeWords=null;

Weoverridethedefaultsetup()methodoftheMappersothatalistofpositiveandnegativewords—specifiedbytwoconfigurationproperties:job.positivewords.pathandjob.negativewords.path—isreadfromHDFSusingthefilesystemAPIwediscussedinthepreviouschapter.WecouldhavealsousedDistributedCachetosharethisdataacrossthecluster.Thehelpermethod,parseWordsList,readsalistofwordlists,stripsoutcomments,andloadswordsintoHashSet<String>:

privateHashSet<String>parseWordsList(FileSystemfs,PathwordsListPath)

Page 141: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

{

HashSet<String>words=newHashSet<String>();

try{

if(fs.exists(wordsListPath)){

FSDataInputStreamfi=fs.open(wordsListPath);

BufferedReaderbr=

newBufferedReader(newInputStreamReader(fi));

Stringline=null;

while((line=br.readLine())!=null){

if(line.length()>0&&!line.startsWith(BEGIN_COMMENT)){

words.add(line);

}

}

fi.close();

}

}

catch(IOExceptione){

e.printStackTrace();

}

returnwords;

}

IntheMapperstep,weemitforeachhashtaginthetweettheoverallsentimentofthetweet(simplythepositivewordcountminusthenegativewordcount)andthelengthofthetweet.

We’llusetheseinthereducertocalculateanoverallsentimentratioweightedbythelengthofthetweetstoestimatethesentimentexpressedbyatweetonahashtag,asfollows:

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

String[]words=value.toString().split("");

IntegerpositiveCount=newInteger(0);

IntegernegativeCount=newInteger(0);

IntegerwordsCount=newInteger(0);

for(Stringstr:words)

{

if(str.matches(HASHTAG_PATTERN)){

hashtags.add(str);

}

if(positiveWords.contains(str)){

positiveCount+=1;

}elseif(negativeWords.contains(str)){

negativeCount+=1;

}

wordsCount+=1;

Page 142: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

}

IntegersentimentDifference=0;

if(wordsCount>0){

sentimentDifference=positiveCount-negativeCount;

}

Stringstats;

for(Stringhashtag:hashtags){

word.set(hashtag);

stats=String.format("%d%d",sentimentDifference,

wordsCount);

context.write(word,newText(stats));

}

}

}

IntheReducerstep,weaddtogetherthesentimentscoresgiventoeachinstanceofthehashtaganddividebythetotalsizeofallthetweetsinwhichitoccurred:

publicstaticclassHashTagSentimentReducer

extendsReducer<Text,Text,Text,DoubleWritable>{

publicvoidreduce(Textkey,Iterable<Text>values,

Contextcontext

)throwsIOException,InterruptedException{

doubletotalDifference=0;

doubletotalWords=0;

for(Textval:values){

String[]parts=val.toString().split("");

totalDifference+=Double.parseDouble(parts[0]);

totalWords+=Double.parseDouble(parts[1]);

}

context.write(key,

newDoubleWritable(totalDifference/totalWords));

}

}

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentiment.java

Afterrunningtheprecedingcode,executeHashTagSentimentwiththefollowingcommand:

$hadoopjarbuild/libs/mapreduce-example.jar

com.learninghadoop2.mapreduce.HashTagSentimenttwitter.txtoutput-sentiment

<positivewords><negativewords>

Youcanexaminetheoutputwiththefollowingcommand:

$hdfsdfs-catoutput-sentiment/part-r-00

000

Youshouldseeanoutputsimilartothefollowing:

#10680.011861271213042056

#10YearsOfLove0.012285135487494233

Page 143: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

#110.011941109121333999

#120.011938693593171155

#12F0.012339242266249566

#12M0.011864286953783268

#12MCalleEnPazYaTeVasNicolas

Intheprecedingoutput,eachlineiscomposedofahashtagandthesentimentpolarityassociatedwithit.Thisnumberisaheuristicthattellsuswhetherahashtagisassociatedmostlywithpositive(polarity>0)ornegative(polarity<0)sentimentandthemagnitudeofsuchasentiment—thehigherorlowerthenumber,thestrongerthesentiment.

Page 144: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TextcleanupusingchainmapperIntheexamplespresenteduntilnow,weignoredakeystepofessentiallyeveryapplicationbuiltaroundtextprocessing,whichisthenormalizationandcleanupoftheinputdata.Threecommoncomponentsofthisnormalizationstepare:

ChangingthelettercasetoeitherlowerorupperRemovalofstopwordsStemming

Inthissection,wewillshowhowtheChainMapperclass—foundatorg.apache.hadoop.mapreduce.lib.chain.ChainMapper—allowsustosequentiallycombineaseriesofMapperstoputtogetherasthefirststepofadatacleanuppipeline.Mappersareaddedtotheconfiguredjobusingthefollowing:

ChainMapper.addMapper(

JobConfjob,

Class<?extendsMapper<K1,V1,K2,V2>>klass,

Class<?extendsK1>inputKeyClass,

Class<?extendsV1>inputValueClass,

Class<?extendsK2>outputKeyClass,

Class<?extendsV2>outputValueClass,JobConfmapperConf)

Thestaticmethod,addMapper,requiresthefollowingargumentstobepassed:

job:JobConftoaddtheMapperclassclass:MapperclasstoaddinputKeyClass:mapperinputkeyclassinputValueClass:mapperinputvalueclassoutputKeyClass:mapperoutputkeyclassoutputValueClass:mapperoutputvalueclassmapperConf:aJobConfwiththeconfigurationfortheMapperclass

Inthisexample,wewilltakecareofthefirstitemlistedabove:beforecomputingthesentimentofeachtweet,wewillconverttolowercaseeachwordpresentinitstext.Thiswillallowustomoreaccuratelyascertainthesentimentofhashtagsbyignoringdifferencesincapitalizationacrosstweets.

Firstofall,wedefineanewMapper—LowerCaseMapper—whosemap()functioncallsJavaString’stoLowerCase()methodonitsinputvalueandemitsthelowercasedtext:

publicclassLowerCaseMapperextendsMapper<LongWritable,Text,

IntWritable,Text>{

privateTextlowercased=newText();

publicvoidmap(LongWritablekey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

lowercased.set(value.toString().toLowerCase());

context.write(newIntWritable(1),lowercased);

}

}

IntheHashTagSentimentChaindriver,weconfiguretheJobobjectsothatbothMappers

Page 145: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

willbechainedtogetherandexecuted:

publicclassHashTagSentimentChain

extendsConfiguredimplementsTool

{

publicintrun(String[]args)throwsException{

Configurationconf=getConf();

args=newGenericOptionsParser(conf,args).getRemainingArgs();

//location(onhdfs)ofthepositivewordslist

conf.set("job.positivewords.path",args[2]);

conf.set("job.negativewords.path",args[3]);

Jobjob=Job.getInstance(conf);

job.setJarByClass(HashTagSentimentChain.class);

ConfigurationlowerCaseMapperConf=newConfiguration(false);

ChainMapper.addMapper(job,

LowerCaseMapper.class,

LongWritable.class,Text.class,

IntWritable.class,Text.class,

lowerCaseMapperConf);

ConfigurationhashTagSentimentConf=newConfiguration(false);

ChainMapper.addMapper(job,

HashTagSentiment.HashTagSentimentMapper.class,

IntWritable.class,

Text.class,Text.class,

Text.class,

hashTagSentimentConf);

job.setReducerClass(HashTagSentiment.HashTagSentimentReducer.class);

job.setInputFormatClass(TextInputFormat.class);

FileInputFormat.addInputPath(job,newPath(args[0]));

job.setOutputFormatClass(TextOutputFormat.class);

FileOutputFormat.setOutputPath(job,newPath(args[1]));

return(job.waitForCompletion(true)?0:1);

}

publicstaticvoidmain(String[]args)throwsException{

intexitCode=ToolRunner.run(

newHashTagSentimentChain(),args);

System.exit(exitCode);

}

}

TheLowerCaseMapperandHashTagSentimentMapperclassesareinvokedinapipeline,wheretheoutputofthefirstbecomestheinputofthesecond.TheoutputofthelastMapperwillbewrittentothetask’soutput.AnimmediatebenefitofthisdesignisareductionofdiskI/Ooperations.Mappersdonotneedtobeawarethattheyarechained.

Page 146: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

It’sthereforepossibletoreusespecializedMappersthatcanbecombinedwithinasingletask.NotethatthispatternassumesthatallMappers—andtheReduce—usematchingoutputandinput(key,value)pairs.NocastingorconversionisdonebyChainMapperitself.

Finally,noticethattheaddMappercallforthelastmapperinthechainspecifiestheoutputkey/valueclassesapplicabletothewholemapperpipelinewhenusedasacomposite.

Thefullsourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch3/src/main/java/com/learninghadoop2/mapreduce/HashTagSentimentChain.java

ExecuteHashTagSentimentChainwiththecommand:

$hadoopjarbuild/libs/mapreduce-example.jar

com.learninghadoop2.mapreduce.HashTagSentimentChaintwitter.txtoutput

<positivewords><negativewords>

Youshouldseeanoutputsimilartothepreviousexample.Noticethatthistime,thehashtagineachlineislowercased.

Page 147: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WalkingthrougharunofaMapReducejobToexploretherelationshipbetweenmapperandreducerinmoredetail,andtoexposesomeofHadoop’sinnerworkings,we’llnowgothroughhowaMapReducejobisexecuted.ThisappliestobothMapReduceinHadoop1andHadoop2eventhoughthelatterisimplementedverydifferentlyusingYARN,whichwe’lldiscusslaterinthischapter.Additionalinformationontheservicesdescribedinthissection,aswellassuggestionsfortroubleshootingMapReduceapplications,canbefoundinChapter10,RunningaHadoopCluster.

Page 148: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StartupThedriveristheonlypieceofcodethatrunsonourlocalmachine,andthecalltoJob.waitForCompletion()startsthecommunicationwiththeJobTracker,whichisthemasternodeintheMapReducesystem.TheJobTrackerisresponsibleforallaspectsofjobschedulingandexecution,soitbecomesourprimaryinterfacewhenperforminganytaskrelatedtojobmanagement.

ToshareresourcesontheclustertheJobTrackercanuseoneofseveralschedulingapproachestohandleincomingjobs.Thegeneralmodelistohaveanumberofqueuestowhichjobscanbesubmittedalongwithpoliciestoassignresourcesacrossthequeues.ThemostcommonlyusedimplementationsforthesepoliciesareCapacityandFairScheduler.

TheJobTrackercommunicateswiththeNameNodeonourbehalfandmanagesallinteractionsrelatingtothedatastoredonHDFS.

Page 149: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SplittingtheinputThefirstoftheseinteractionshappenswhentheJobTrackerlooksattheinputdataanddetermineshowtoassignittomaptasks.RecallthatHDFSfilesareusuallysplitintoblocksofatleast64MBandtheJobTrackerwillassigneachblocktoonemaptask.OurWordCountexample,ofcourse,usedatrivialamountofdatathatwaswellwithinasingleblock.Pictureamuchlargerinputfilemeasuredinterabytes,andthesplitmodelmakesmoresense.Eachsegmentofthefile—orsplit,inMapReduceterminology—isprocesseduniquelybyonemaptask.Onceithascomputedthesplits,theJobTrackerplacesthemandtheJARfilecontainingtheMapperandReducerclassesintoajob-specificdirectoryonHDFS,whosepathwillbepassedtoeachtaskasitstarts.

Page 150: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TaskassignmentTheTaskTrackerserviceisresponsibleforallocatingresources,executingandtrackingthestatusofmapandreducetasksrunningonanode.OncetheJobTrackerhasdeterminedhowmanymaptaskswillbeneeded,itlooksatthenumberofhostsinthecluster,howmanyTaskTrackersareworking,andhowmanymaptaskseachcanconcurrentlyexecute(auser-definableconfigurationvariable).TheJobTrackeralsolookstoseewherethevariousinputdatablocksarelocatedacrosstheclusterandattemptstodefineanexecutionplanthatmaximizesthecaseswhentheTaskTrackerprocessesasplit/blocklocatedonthesamephysicalhost,or,failingthat,itprocessesatleastoneinthesamehardwarerack.ThisdatalocalityoptimizationisahugereasonbehindHadoop’sabilitytoefficientlyprocesssuchlargedatasets.Recallalsothat,bydefault,eachblockisreplicatedacrossthreedifferenthosts,sothelikelihoodofproducingatask/hostplanthatseesmostblocksprocessedlocallyishigherthanitmightseematfirst.

Page 151: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TaskstartupEachTaskTrackerthenstartsupaseparateJavavirtualmachinetoexecutethetasks.Thisdoesaddastartuptimepenalty,butitisolatestheTaskTrackerfromproblemscausedbymisbehavingmaporreducetasks,anditcanbeconfiguredtobesharedbetweensubsequentlyexecutedtasks.

Iftheclusterhasenoughcapacitytoexecuteallthemaptasksatonce,theywillallbestartedandgivenareferencetothesplittheyaretoprocessandthejobJARfile.Iftherearemoretasksthantheclustercapacity,theJobTrackerwillkeepaqueueofpendingtasksandassignthemtonodesastheycompletetheirinitiallyassignedmaptasks.

Wearenowreadytoseetheexecuteddataofmaptasks.Ifallthissoundslikealotofwork,itis;itexplainswhy,whenrunninganyMapReducejob,thereisalwaysanon-trivialamountoftimetakenasthesystemgetsstartedandperformsallthesesteps.

Page 152: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OngoingJobTrackermonitoringTheJobTrackerdoesn’tjuststopworknowandwaitfortheTaskTrackerstoexecuteallthemappersandreducers.It’sconstantlyexchangingheartbeatandstatusmessageswiththeTaskTrackers,lookingforevidenceofprogressorproblems.Italsocollectsmetricsfromthetasksthroughoutthejobexecution,someprovidedbyHadoopandothersspecifiedbythedeveloperofthemapandreducetasks,althoughwedon’tuseanyinthisexample.

Page 153: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapperinputThedriverclassspecifiestheformatandstructureoftheinputfileusingTextInputFormat,andfromthis,Hadoopknowstotreatthisastextwiththebyteoffsetasthekeyandlinecontentsasthevalue.Assumethatourdatasetcontainsthefollowingtext:

Thisisatest

Yesitis

Thetwoinvocationsofthemapperwillthereforebegiventhefollowingoutput:

1Thisisatest

2Yesitis

Page 154: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapperexecutionThekey/valuepairsreceivedbythemapperaretheoffsetinthefileofthelineandthelinecontents,respectively,becauseofhowthejobisconfigured.OurimplementationofthemapmethodinWordCountMapperdiscardsthekey,aswedonotcarewhereeachlineoccurredinthefile,andsplitstheprovidedvalueintowordsusingthesplitmethodonthestandardJavaStringclass.NotethatbettertokenizationcouldbeprovidedbyuseofregularexpressionsortheStringTokenizerclass,butforourpurposesthissimpleapproachwillsuffice.Foreachindividualword,themapperthenemitsakeycomprisedoftheactualworditself,andavalueof1.

Page 155: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapperoutputandreducerinputTheoutputofthemapperisaseriesofpairsoftheform(word,1);inourexample,thesewillbe:

(This,1),(is,1),(a,1),(test,1),(Yes,1),(it,1),(is,1)

Theseoutputpairsfromthemapperarenotpasseddirectlytothereducer.Betweenmappingandreducingistheshufflestage,wheremuchofthemagicofMapReduceoccurs.

Page 156: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ReducerinputThereducerTaskTrackerreceivesupdatesfromtheJobTrackerthattellitwhichnodesintheclusterholdmapoutputpartitionsthatneedtobeprocessedbyitslocalreducetask.Itthenretrievesthesefromthevariousnodesandmergesthemintoasinglefilethatwillbefedtothereducetask.

Page 157: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ReducerexecutionOurWordCountReducerclassisverysimple;foreachword,itsimplycountsthenumberofelementsinthearrayandemitsthefinal(word,count)outputforeachword.ForourinvocationofWordCountonoursampleinput,allbutonewordhasonlyonevalueinthelistofvalues;ishastwo.

Page 158: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ReduceroutputThefinalsetofreduceroutputforourexampleistherefore:

(This,1),(is,2),(a,1),(test,1),(Yes,1),(it,1)

ThisdatawillbeoutputtopartitionfileswithintheoutputdirectoryspecifiedinthedriverthatwillbeformattedusingthespecifiedOutputFormatimplementation.Eachreducetaskwritestoasinglefilewiththefilenamepart-r-nnnnn,wherennnnnstartsat00000andisincremented.

Page 159: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ShutdownOncealltaskshavecompletedsuccessfully,theJobTrackeroutputsthefinalstateofthejobtotheclient,alongwiththefinalaggregatesofsomeofthemoreimportantcountersthatithasbeenaggregatingalongtheway.Thefulljobandtaskhistoryisavailableinthelogdirectoryoneachnodeor,moreaccessibly,viatheJobTrackerwebUI;pointyourbrowsertoport50030ontheJobTrackernode.

Page 160: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Input/OutputWehavetalkedaboutfilesbeingbrokenintosplitsaspartofthejobstartupandthedatainasplitbeingsenttothemapperimplementation.However,thisoverlookstwoaspects:howthedataisstoredinthefileandhowtheindividualkeysandvaluesarepassedtothemapperstructure.

Page 161: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

InputFormatandRecordReaderHadoophastheconceptofInputFormatforthefirstoftheseresponsibilities.TheInputFormatabstractclassintheorg.apache.hadoop.mapreducepackageprovidestwomethodsasshowninthefollowingcode:

publicabstractclassInputFormat<K,V>

{

publicabstractList<InputSplit>getSplits(JobContextcontext);

RecordReader<K,V>createRecordReader(InputSplitsplit,

TaskAttemptContextcontext);

}

ThesemethodsdisplaythetworesponsibilitiesoftheInputFormatclass:

ToprovidedetailsonhowtodivideaninputfileintothesplitsrequiredformapprocessingTocreateaRecordReaderthatwillgeneratetheseriesofkey/valuepairsfromasplit

TheRecordReaderclassisalsoanabstractclasswithintheorg.apache.hadoop.mapreducepackage:

publicabstractclassRecordReader<Key,Value>implementsCloseable

{

publicabstractvoidinitialize(InputSplitsplit,

TaskAttemptContextcontext);

publicabstractbooleannextKeyValue()

throwsIOException,InterruptedException;

publicabstractKeygetCurrentKey()

throwsIOException,InterruptedException;

publicabstractValuegetCurrentValue()

throwsIOException,InterruptedException;

publicabstractfloatgetProgress()

throwsIOException,InterruptedException;

publicabstractclose()throwsIOException;

}

ARecordReaderinstanceiscreatedforeachsplitandcallsgetNextKeyValuetoreturnaBooleanindicatingwhetheranotherkey/valuepairisavailable,and,ifso,thegetKeyandgetValuemethodsareusedtoaccessthekeyandvaluerespectively.

ThecombinationoftheInputFormatandRecordReaderclassesthereforeareallthatisrequiredtobridgebetweenanykindofinputdataandthekey/valuepairsrequiredbyMapReduce.

Page 162: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Hadoop-providedInputFormatTherearesomeHadoop-providedInputFormatimplementationswithintheorg.apache.hadoop.mapreduce.lib.inputpackage:

FileInputFormat:isanabstractbaseclassthatcanbetheparentofanyfile-basedinput.SequenceFileInputFormat:isanefficientbinaryfileformatthatwillbediscussedinanupcomingsection.TextInputFormat:isusedforplaintextfiles.KeyValueTextInputFormat:isusedforplaintextfiles.Eachlineisdividedintokeyandvaluepartsbyaseparatorbyte.

Notethatinputformatsarenotrestrictedtoreadingfromfiles;FileInputFormatisitselfasubclassofInputFormat.It’spossibletohaveHadoopusedatathatisnotbasedonfilesastheinputtoMapReducejobs;commonsourcesarerelationaldatabasesorcolumn-orienteddatabases,suchasAmazonDynamoDBorHBase.

Page 163: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Hadoop-providedRecordReaderHadoopprovidesafewcommonRecordReaderimplementations,whicharealsopresentwithintheorg.apache.hadoop.mapreduce.lib.inputpackage:

LineRecordReader:implementationisthedefaultRecordReaderclassfortextfilesthatpresentsthebyteoffsetinthefileasthekeyandthelinecontentsasthevalueSequenceFileRecordReader:implementationreadsthekey/valuefromthebinarySequenceFilecontainer

Page 164: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OutputFormatandRecordWriterThereisasimilarpatternforwritingtheoutputofajobcoordinatedbysubclassesofOutputFormatandRecordWriterfromtheorg.apache.hadoop.mapreducepackage.Wewon’texploretheseinanydetailhere,butthegeneralapproachissimilar,althoughOutputFormatdoeshaveamoreinvolvedAPI,asithasmethodsfortaskssuchasvalidationoftheoutputspecification.

It’sthisstepthatcausesajobtofailifaspecifiedoutputdirectoryalreadyexists.Ifyouwanteddifferentbehavior,itwouldrequireasubclassofOutputFormatthatoverridesthismethod.

Page 165: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Hadoop-providedOutputFormatThefollowingoutputformatsareprovidedintheorg.apache.hadoop.mapreduce.outputpackage:

FileOutputFormat:isthebaseclassforallfile-basedOutputFormatsNullOutputFormat:isadummyimplementationthatdiscardstheoutputandwritesnothingtothefileSequenceFileOutputFormat:writestothebinarySequenceFileformatTextOutputFormat:writesaplaintextfile

NotethattheseclassesdefinetheirrequiredRecordWriterimplementationsasstaticnestedclasses,sotherearenoseparatelyprovidedRecordWriterimplementations.

Page 166: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SequencefilesTheSequenceFileclasswithintheorg.apache.hadoop.iopackageprovidesanefficientbinaryfileformatthatisoftenusefulasanoutputfromaMapReducejob.Thisisespeciallytrueiftheoutputfromthejobisprocessedastheinputofanotherjob.Sequencefileshaveseveraladvantages,asfollows:

Asbinaryfiles,theyareintrinsicallymorecompactthantextfilesTheyadditionallysupportoptionalcompression,whichcanalsobeappliedatdifferentlevels,thatis,theycompresseachrecordoranentiresplitTheycanbesplitandprocessedinparallel

Thislastcharacteristicisimportantasmostbinaryformats—particularlythosethatarecompressedorencrypted—cannotbesplitandmustbereadasasinglelinearstreamofdata.UsingsuchfilesasinputtoaMapReducejobmeansthatasinglemapperwillbeusedtoprocesstheentirefile,causingapotentiallylargeperformancehit.Insuchasituation,it’spreferabletouseasplittableformat,suchasSequenceFile,or,ifyoucannotavoidreceivingthefileinanotherformat,doapreprocessingstepthatconvertsitintoasplittableformat.Thiswillbeatradeoff,astheconversionwilltaketime,butinmanycases—especiallywithcomplexmaptasks—thiswillbeoutweighedbythetimesavedthroughincreasedparallelism.

Page 167: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YARNYARNstartedoutaspartoftheMapReducev2(MRv2)initiativebutisnowanindependentsub-projectwithinHadoop(thatis,it’satthesamelevelasMapReduce).ItgrewoutofarealizationthatMapReduceinHadoop1conflatedtworelatedbutdistinctresponsibilities:resourcemanagementandapplicationexecution.

Althoughithasenabledpreviouslyunimaginedprocessingonenormousdatasets,theMapReducemodelataconceptuallevelhasanimpactonperformanceandscalability.ImplicitintheMapReducemodelisthatanyapplicationcanonlybecomposedofaseriesoflargelylinearMapReducejobs,eachofwhichfollowsamodelofoneormoremapsfollowedbyoneormorereduces.Thismodelisagreatfitforsomeapplications,butnotall.Inparticular,it’sapoorfitforworkloadsrequiringverylow-latencyresponsetimes;theMapReducestartuptimesandsometimeslengthyjobchainsoftengreatlyexceedthetoleranceforauser-facingprocess.Themodelhasalsobeenfoundtobeveryinefficientforjobsthatwouldmorenaturallyberepresentedasadirectedacyclicgraph(DAG)oftaskswherethenodesonthegraphareprocessingsteps,andtheedgesaredataflows.IfanalyzedandexecutedasaDAGthentheapplicationmaybeperformedinonestepwithhighparallelismacrosstheprocessingsteps,butwhenviewedthroughtheMapReducelens,theresultisusuallyaninefficientseriesofinterdependentMapReducejobs.

NumerousprojectshavebuiltdifferenttypesofprocessingatopMapReduceandalthoughmanyarewildlysuccessful(ApacheHiveandPigaretwostandoutexamples),theclosecouplingofMapReduceasaprocessingparadigmwiththejobschedulingmechanisminHadoop1madeitverydifficultforanynewprojecttotailoreitheroftheseareastoitsspecificneeds.

TheresultisYetAnotherResourceNegotiator(YARN),whichprovidesahighlycapablejobschedulingmechanismwithinHadoopandthewell-definedinterfacesfordifferentprocessingmodelstobeimplementedwithinit.

Page 168: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YARNarchitectureTounderstandhowYARNworks,it’simportanttostopthinkingaboutMapReduceandhowitprocessesdata.YARNitselfsaysnothingaboutthenatureoftheapplicationsthatrunatopit,ratherit’sfocusedonprovidingthemachineryfortheschedulingandexecutionofthesejobs.Aswe’llsee,YARNisjustascapableofhostinglong-runningstreamprocessingorlow-latency,user-facingworkloadsasitiscapableofhostingbatch-processingworkloads,suchasMapReduce.

ThecomponentsofYARNYARNiscomprisedoftwomaincomponents,theResourceManager(RM),whichmanagesresourcesacrossthecluster,andtheNodeManager(NM),whichrunsoneachhostandmanagestheresourcesontheindividualmachine.TheResourceManagerandNodeManagersdealwiththeschedulingandmanagementofcontainers,anabstractnotionofthememory,CPU,andI/Othatwillbededicatedtorunaparticularpieceofapplicationcode.UsingMapReduceasanexample,whenrunningatopYARN,theJobTrackerandeachTaskTrackerallrunintheirowndedicatedcontainers.Notethough,thatinYARN,eachMapReducejobhasitsowndedicatedJobTracker;thereisnosingleinstancethatmanagesalljobs,asinHadoop1.

YARNitselfisresponsibleonlyfortheschedulingoftasksacrossthecluster;allnotionsofapplication-levelprogress,monitoring,andfaulttolerancearehandledintheapplicationcode.Thisisaveryexplicitdesigndecision;bymakingYARNasindependentaspossible,ithasaveryclearsetofresponsibilitiesanddoesnotartificiallyconstrainthetypesofapplicationthatcanbeimplementedonYARN.

Asthearbiterofallclusterresources,YARNhastheabilitytoefficientlymanagetheclusterasawholeandnotfocusonapplication-levelresourcerequirements.IthasapluggableschedulingpolicywiththeprovidedimplementationssimilartotheexistingHadoopCapacityandFairScheduler.YARNalsotreatsallapplicationcodeasinherentlyuntrustedandallapplicationmanagementandcontroltasksareperformedinuserspace.

AnatomyofaYARNapplicationAsubmittedYARNapplicationhastwocomponents:theApplicationMaster(AM),whichcoordinatestheoverallapplicationflow,andthespecificationofthecodethatwillrunontheworkernodes.ForMapReduceatopYARN,theJobTrackerimplementstheApplicationMasterfunctionalityandTaskTrackersaretheapplicationcustomcodedeployedontheworkernodes.

Asmentionedintheprevioussection,theresponsibilitiesofapplicationmanagement,progressmonitoringandfaulttolerancearepushedtotheapplicationlevelinYARN.It’stheApplicationMasterthatperformsthesetasks;YARNitselfsaysnothingaboutthemechanismsforcommunicationbetweentheApplicationMasterandthecoderunningintheworkercontainers,forexample.

ThisgenericityallowsYARNapplicationstonotbetiedtoJavaclasses.The

Page 169: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApplicationManagercaninsteadrequestaNodeManagertoexecuteshellscripts,nativeapplications,oranyothertypeofprocessingthatismadeavailableoneachnode.

Page 170: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LifecycleofaYARNapplicationAswithMapReducejobsinHadoop1,YARNapplicationsaresubmittedtotheclusterbyaclient.WhenaYARNapplicationisstarted,theclientfirstcallstheResourceManager(morespecificallytheApplicationManagerportionoftheResourceManager)andrequeststheinitialcontainerwithinwhichtoexecutetheApplicationMaster.InmostcasestheApplicationMasterwillrunfromahostedcontainerinthecluster,justaswilltherestoftheapplicationcode.TheApplicationManagercommunicateswiththeothermaincomponentoftheResourceManager,thescheduleritself,whichhastheultimateresponsibilityofmanagingallresourcesacrossthecluster.

TheApplicationMasterstartsupintheprovidedcontainer,registersitselfwiththeResourceManager,andbeginstheprocessofnegotiatingitsrequiredresources.TheApplicationMastercommunicateswiththeResourceManagerandrequeststhecontainersitrequires.Thespecificationofthecontainersrequestedcanalsoincludeadditionalinformation,suchasdesiredplacementwithintheclusterandconcreteresourcerequirements,suchasaparticularamountofmemoryorCPU.

TheResourceManagerprovidestheApplicationMasterwiththedetailsofthecontainersithasbeenallocated,andtheApplicationMasterthencommunicateswiththeNodeManagerstostarttheapplication-specifictaskforeachcontainer.ThisisdonebyprovidingtheNodeManagerwiththespecificationoftheapplicationtobeexecuted,whichasmentionedmaybeaJARfile,ascript,apathtoalocalexecutable,oranythingelsethattheNodeManagercaninvoke.EachNodeManagerinstantiatesthecontainerfortheapplicationcodeandstartstheapplicationbasedontheprovidedspecification.

FaulttoleranceandmonitoringFromthispointonward,thebehaviorislargelyapplicationspecific.YARNwillnotmanageapplicationprogressbutdoesperformanumberofongoingtasks.TheAMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromallApplicationMasters,andifitdeterminesthatanApplicationMasterhasfailedorstoppedworking,itwillde-registerthefailedApplicationMasterandreleaseallitsallocatedcontainers.TheResourceManagerwillthenrescheduletheapplicationaconfigurablenumberoftimes.

AlongsidethisprocesstheNMLivelinessMonitorwithintheResourceManagerreceivesheartbeatsfromtheNodeManagersandkeepstrackofthehealthofeachNodeManagerinthecluster.SimilartothemonitoringofApplicationMasterhealth,aNodeManagerwillbemarkedasdeadafterreceivingnoheartbeatsforadefaulttimeof10minutes,afterwhichallallocatedcontainersaremarkedasdead,andthenodeisexcludedfromfutureresourceallocation.

Atthesametime,theNodeManagerwillactivelymonitorresourceutilizationofeachallocatedcontainerand,forthoseresourcesnotconstrainedbyhardlimits,willkillcontainersthatexceedtheirresourceallocation.

Page 171: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Atahigherlevel,theYARNschedulerwillalwaysbelookingtomaximizetheclusterutilizationwithintheconstraintsofthesharingpolicybeingemployed.AswithHadoop1,thiswillallowlow-priorityapplicationstousemoreclusterresourcesifcontentionislow,buttheschedulerwillthenpreempttheseadditionalcontainers(thatis,requestthemtobeterminated)ifhigher-priorityapplicationsaresubmitted.

Therestoftheresponsibilityforapplication-levelfaulttoleranceandprogressmonitoringmustbeimplementedwithintheapplicationcode.ForMapReduceonYARN,forexample,allthemanagementoftaskschedulingandretriesisprovidedattheapplicationlevelandisnotinanywaydeliveredbyYARN.

Page 172: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ThinkinginlayersTheselaststatementsmaysuggestthatwritingapplicationstorunonYARNisalotofwork,andthisistrue.TheYARNAPIisquitelow-levelandlikelyintimidatingformostdeveloperswhojustwanttorunsomeprocessingtasksontheirdata.IfallwehadwasYARNandeverynewHadoopapplicationhadtohaveitsownApplicationMasterimplemented,thenYARNwouldnotlookquiteasinterestingasitdoes.

Whatmakesthepicturebetteristhat,ingeneral,therequirementisn’ttoimplementeachandeveryapplicationonYARN,butinsteaduseitforasmallernumberofprocessingframeworksthatprovidemuchfriendlierinterfacestobeimplemented.ThefirstofthesewasMapReduce;withithostedonYARN,thedeveloperwritestotheusualmapandreduceinterfacesandislargelyunawareoftheYARNmechanics.

Butonthesamecluster,anotherdevelopermayberunningajobthatusesadifferentframeworkwithsignificantlydifferentprocessingcharacteristics,andYARNwillmanagebothatthesametime.

We’llgivesomemoredetailonseveralYARNprocessingmodelscurrentlyavailable,buttheyrunthegamutfrombatchprocessingthroughlow-latencyqueriestostreamandgraphprocessingandbeyond.

AstheYARNexperiencegrows,however,thereareanumberofinitiativestomakethedevelopmentoftheseprocessingframeworkseasier.Ontheonehandtherearehigher-levelinterfaces,suchasClouderaKitten(https://github.com/cloudera/kitten)orApacheTwill(http://twill.incubator.apache.org/),thatgivefriendlierabstractionsabovetheYARNAPIs.Perhapsamoresignificantdevelopmentmodel,though,istheemergenceofframeworksthatproviderichertoolstomoreeasilyconstructapplicationswithacommongeneralclassofperformancecharacteristics.

Page 173: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ExecutionmodelsWehavementioneddifferentYARNapplicationshavingdistinctprocessingcharacteristics,butanemergingpatternhasseentheirexecutionmodelsingeneralbeingasourceofdifferentiation.Bythis,werefertohowtheYARNapplicationlifecycleismanaged,andweidentifythreemaintypes:per-jobapplication,per-session,andalways-on.

Batchprocessing,suchasMapReduceonYARN,seesthelifecycleoftheMapReduceframeworktiedtothatofthesubmittedapplication.IfwesubmitaMapReducejob,thentheJobTrackerandTaskTrackersthatexecuteitarecreatedspecificallyforthejobandareterminatedwhenthejobcompletes.Thisworkswellforbatch,butifwewishtoprovideamoreinteractivemodelthenthestartupoverheadofestablishingtheYARNapplicationandallitsresourceallocationswillseverelyimpacttheuserexperienceifeverycommandissuedsuffersthispenalty.Amoreinteractive,orsession-based,lifecyclewillseetheYARNapplicationstartandthenbeavailabletoserviceanumberofsubmittedrequests/commands.TheYARNapplicationterminatesonlywhenthesessionisexited.

Finally,wehavetheconceptoflong-runningapplicationsthatprocesscontinuousdatastreamsindependentofanyinteractiveinput.FortheseitmakesmostsensefortheYARNapplicationtostartandcontinuouslyprocessdatathatisretrievedthroughsomeexternalmechanism.Theapplicationwillonlyexitwhenexplicitlyshutdownorifanabnormalsituationoccurs.

Page 174: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YARNintherealworld–ComputationbeyondMapReduceThepreviousdiscussionshavebeenalittleabstract,sointhissection,wewillexploreafewexistingYARNapplicationstoseejusthowtheyusetheframeworkandhowtheyprovideabreadthofprocessingcapability.OfparticularinterestishowtheYARNframeworkstakeverydifferentapproachestoresourcemanagement,I/Opipelining,andfaulttolerance.

Page 175: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheproblemwithMapReduceUntilnow,wehavelookedatMapReduceintermsofAPI.MapReduceinHadoopismorethanthat;upuntilHadoop2,itwasthedefaultexecutionengineforanumberoftools,amongwhichwereHiveandPig,whichwewilldiscussinmoredetaillaterinthisbook.WehaveseenhowMapReduceapplicationsare,infact,chainsofjobs.Thisveryaspectisonethebiggestpainpointsandconstrainingfactorsoftheframeworks.MapReducecheckpointsdatatoHDFSforintra-processcommunication:

AchainofMapReducejobs

Attheendofeachreducephase,outputiswrittentodisksothatitcanthenbeloadedbythemappersofthenextjobandusedasitsinput.ThisI/Ooverheadintroduceslatency,especiallywhenwehaveapplicationsthatrequiremultiplepassesonadataset(hencemultiplewrites).Unfortunately,thistypeofiterativecomputationisatthecoreofmanyanalyticsapplications.

ApacheTezandApacheSparkaretwoframeworksthataddressthisproblembygeneralizingtheMapReduceparadigm.Wewillbrieflydiscussthemintheremainderofthissection,nexttoApacheSamza,aframeworkthattakesanentirelydifferentapproachtoreal-timeprocessing.

Page 176: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TezTez(http://tez.apache.org)isalow-levelAPIandexecutionenginefocusedonprovidinglow-latencyprocessing,andisbeingusedasthebasisofthelatestevolutionofHive,Pigandseveralotherframeworksthatimplementstandardjoin,filter,mergeandgroupoperations.TezisanimplementationandevolutionofaprogrammingmodelpresentedbyMicrosoftinthe2009Dryadpaper(http://research.microsoft.com/en-us/projects/dryad/).TezisageneralizationofMapReduceasdataflowthatstrivestoachievefast,interactivecomputingbypipeliningI/Ooperationsoveraqueueforintra-processcommunication.ThisavoidstheexpensivewritestodisksthataffectMapReduce.TheAPIprovidesprimitivesexpressingdependenciesbetweenjobsasaDAG.ThefullDAGisthensubmittedtoaplannerthatcanoptimizetheexecutionflow.ThesameapplicationdepictedintheprecedingdiagramwouldbeexecutedinTezasasinglejob,withI/OpipelinedfromreducerstoreducerswithoutHDFSwritesandsubsequentreadsbymappers.Anexamplecanbeseeninthefollowingdiagram:.

ATezDAGisageneralizationofMapReduce

ThecanonicalWordCountexamplecanbefoundathttps://github.com/apache/incubator-tez/blob/master/tez-mapreduce-

Page 177: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java.

DAGdag=newDAG("WordCount");

dag.addVertex(tokenizerVertex)

.addVertex(summerVertex)

.addEdge(newEdge(tokenizerVertex,summerVertex,

edgeConf.createDefaultEdgeProperty()));

Eventhoughthegraphtopologydagcanbeexpressedwithafewlinesofcode,theboilerplaterequiredtoexecutethejobisconsiderable.Thiscodehandlesmanyofthelow-levelschedulingandexecutionresponsibilities,includingfaulttolerance.WhenTezdetectsafailedtask,itwalksbackuptheprocessinggraphtofindthepointfromwhichtore-executethefailedtasks.

Hive-on-tezHive0.13isthefirsthigh-profileprojecttouseTezasitsexecutionengine.We’lldiscussHiveinalotmoredetailinChapter7,HadoopandSQL,butfornowwewilljusttouchonhowit’simplementedonYARN.

Hive(http://hive.apache.org)isanengineforqueryingdatastoredonHDFSthroughstandardSQLsyntax.Ithasbeenenormouslysuccessful,asthistypeofcapabilitygreatlyreducesthebarrierstostartanalyticexplorationofdatainHadoop.

InHadoop1,Hivehadnochoice,buttoimplementitsSQLstatementsasaseriesofMapReducejobs.WhenSQLissubmittedtoHive,itgeneratestherequiredMapReducejobsbehindthescenesandexecutestheseonthecluster.Thisapproachhastwomaindrawbacks:thereisanon-trivialstartuppenaltyeachtime,andtheconstrainedMapReducemodelmeansthatseeminglysimpleSQLstatementsareoftentranslatedintoalengthyseriesofmultipledependentMapReducejobs.ThisisanexampleofthetypeofprocessingmorenaturallyconceptualizedasaDAGoftasks,asdescribedearlierinthischapter.

AlthoughsomebenefitsareachievedwhenHiveexecuteswithinMapReduce,withinYARN,themajorbenefitscomeinHive0.13whentheprojectisfullyre-implementedusingTez.ByexploitingtheTezAPIs,whicharefocusedonprovidinglow-latencyprocessing,Hivegainsevenmoreperformancewhilemakingitscodebasesimpler.

SinceTeztreatsitsworkloadsastheDAGswhichprovideamuchbetterfittotranslatedSQLqueries,HiveonTezcanperformanySQLstatementasasinglejobwithmaximizedparallelism.

TezhelpsHivesupportinteractivequeriesbyprovidinganalways-runningserviceinsteadofrequiringtheapplicationtobeinstantiatedfromscratchforeachSQLsubmission.Thisisimportantbecause,eventhoughqueriesthatprocesshugedatavolumeswillsimplytakesometime,thegoalisforHivetobecomelessofabatchtoolandinsteadmovetobeasmuchofaninteractivetoolaspossible.

Page 178: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheSparkSpark(.apache.org)isaprocessingframeworkthatexcelsatiterativeandnearreal-timeprocessing.CreatedatUCBerkeley,ithasbeendonatedasanApacheproject.SparkprovidesanabstractionthatallowsdatainHadooptobeviewedasadistributeddatastructureuponwhichaseriesofoperationscanbeperformed.TheframeworkisbasedonthesameconceptsTezdrawsinspirationfrom(Dryad),butexcelswithjobsthatallowdatatobeheldandprocessedinmemory,anditcanveryefficientlyscheduleprocessingonthein-memorydatasetacrossthecluster.Sparkautomaticallycontrolsreplicationofdataacrossthecluster,ensuringthateachelementofthedistributeddatasetisheldinmemoryonatleasttwomachines,andprovidesreplication-basedfaulttolerancesomewhatakintoHDFS.

Sparkstartedasastandalonesystem,butwasportedtoalsorunonYARNasofits0.8release.Sparkisparticularlyinterestingbecause,althoughitsclassicprocessingmodelisbatch-oriented,withtheSparkshellitprovidesaninteractivefrontendandwiththeSparkStreamingsub-projectalsooffersnearreal-timeprocessingofdatastreams.Sparkisdifferentthingstodifferentpeople;it’sbothahigh-levelAPIandanexecutionengine.Atthetimeofwriting,portsofHiveandPigtoSparkareinprogress.

Page 179: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheSamzaSamza(http://samza.apache.org)isastream-processingframeworkdevelopedatLinkedInanddonatedtotheApacheSoftwareFoundation.Samzaprocessesconceptuallyinfinitestreamsofdata,whichareseenbytheapplicationasaseriesofmessages.

SamzacurrentlyintegratesmosttightlywithApacheKafka(http://kafka.apache.org)althoughitdoeshaveapluggablearchitecture.Kafkaitselfisamessagingsystemthatexcelsatlargedatavolumesandprovidesatopic-basedabstractionsimilartomostothermessagingplatforms,suchasRabbitMQ.Publisherssendmessagestotopicsandinterestedclientsconsumemessagesfromthetopicsastheyarrive.Kafkahasmultipleaspectsthatsetitapartfromothermessagingplatforms,butforthisdiscussion,themostinterestingoneisthatKafkastoresmessagesforaperiodoftime,whichallowsmessagesintopicstobereplayed.Topicsarepartitionedacrossmultiplehostsandpartitionscanbereplicatedacrosshoststoprotectfromnodefailure.

Samzabuildsitsprocessingflowonitsconceptofstreams,whichwhenusingKafkamapdirectlytoKafkapartitions.AtypicalSamzajobmaylistentoonetopicforincomingmessages,performsometransformations,andthenwritetheoutputtoadifferenttopic.MultipleSamzajobscanthenbecomposedtoprovidemorecomplexprocessingstructures.

AsaYARNapplication,theSamzaApplicationMastermonitorsthehealthofallrunningSamzatasks.Ifataskfails,thenareplacementtaskisinstantiatedinanewcontainer.Samzaachievesfaulttolerancebyhavingeachtaskwriteitsprogresstoanewstream(againmodeledasaKafkatopic),soanyreplacementtaskjustneedstoreadthelatesttaskstatefromthischeckpointtopicandthenreplaythemainmessagetopicfromthelastprocessedposition.Samzaadditionallyofferssupportforlocaltaskstate,whichcanbeveryusefulforjoinandaggregationtypeworkloads.Thislocalstateisagainbuiltatopthestreamabstractionandhenceisintrinsicallyresilienttohostfailure.

YARN-independentframeworksAninterestingpointtonoteisthattwooftheprecedingprojects(SamzaandSpark)runatopYARNbutarenotspecifictoYARN.Sparkstartedoutasastandaloneserviceandhasimplementationsforotherschedulers,suchasApacheMesosortorunonAmazonEC2.ThoughSamzarunsonlyonYARNtoday,itsarchitectureexplicitlyisnotYARN-specific,andtherearediscussionsaboutprovidingrealizationsonotherplatforms.

IftheYARNmodelofpushingasmuchaspossibleintotheapplicationhasitsdownsidesthroughimplementationcomplexity,thenthisdecouplingisoneofitsmajorbenefits.AnapplicationwrittentouseYARNneednotbetiedtoit;bydefinition,allthefunctionalityfortheactualapplicationlogicandmanagementisencapsulatedwithintheapplicationcodeandisindependentofYARNoranotherframework.Thisis,ofcourse,notsayingthatdesigningascheduler-independentapplicationisatrivialtask,butit’snowatractabletask;thiswasabsolutelynotthecasepre-YARN.

Page 180: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YARNtodayandbeyondThoughYARNhasbeenusedinproduction(atYahoo!inparticular)forsometime,thefinalGAversionwasnotreleaseduntillate2012.TheinterfacestoYARNwerealsosomewhatfluiduntilquitelateinthedevelopmentcycle.Consequently,thefullyforwardcompatibleYARNasofHadoop2.2isstillrelativelynew.

YARNisfullyfunctionaltoday,andthefuturedirectionwillseeextensionstoitscurrentcapabilities.Perhapsmostnotableamongthesewillbetheabilitytospecifyandcontrolcontainerresourcesonmoredimensions.Currently,onlylocation,memoryandCPUspecificationsarepossible,andthiswillbeexpandedintoareassuchasstorageandnetworkI/O.

Inaddition,theApplicationMastercurrentlyhaslittlecontroloverthemanagementofhowcontainersareco-locatedornot.Finer-grainedcontrolherewillallowtheApplicationMastertospecifypoliciesaroundwhencontainersmayormaynotbescheduledonthesamenode.Inaddition,thecurrentresourceallocationmodelisquitestatic,anditwillbeusefultoallowanapplicationtodynamicallychangetheresourcesallocatedtoarunningcontainer.

Page 181: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryThischapterexploredhowtoprocessthoselargevolumesofdatathatwediscussedsomuchinthepreviouschapter.Inparticularwecovered:

HowMapReducewastheonlyprocessingmodelavailableinHadoop1anditsconceptualmodelTheJavaAPItoMapReduce,andhowtousethistobuildsomeexamples,fromawordcounttosentimentanalysisofTwitterhashtagsThedetailsofhowMapReduceisimplementedinpractice,andwewalkedthroughtheexecutionofaMapReducejobHowHadoopstoresdataandtheclassesinvolvedtorepresentinputandoutputformatsandrecordreadersandwritersThelimitationsofMapReducethatledtothedevelopmentofYARN,openingthedoortomultiplecomputationalmodelsontheHadoopplatformTheYARNarchitectureandhowapplicationsarebuiltatopit

Inthenexttwochapters,wewillmoveawayfromstrictlybatchprocessinganddelveintotheworldofnearreal-timeanditerativeprocessing,usingtwooftheYARN-hostedframeworksweintroducedinthischapter,namelySamzaandSpark.

Page 182: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter4.Real-timeComputationwithSamzaThepreviouschapterdiscussedYARN,andfrequentlymentionedthebreadthofcomputationalmodelsandprocessingframeworksoutsideoftraditionalbatch-basedMapReducethatitenablesontheHadoopplatform.Inthischapterandthenext,wewillexploretwosuchprojectsinsomedepth,namelyApacheSamzaandApacheSpark.Wechosetheseframeworksastheydemonstratetheusageofstreamanditerativeprocessingandalsoprovideinterestingmechanismstocombineprocessingparadigms.InthischapterwewillexploreSamzaandcoverthefollowingtopics:

WhatSamzaisandhowitintegrateswithYARNandotherprojectssuchasApacheKafkaHowSamzaprovidesasimplecallback-basedinterfaceforstreamprocessingHowSamzacomposesmultiplestreamprocessingjobsintomorecomplexworkflowsHowSamzasupportspersistentlocalstatewithintasksandhowthisgreatlyenricheswhatitcanenable

Page 183: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StreamprocessingwithSamzaToexploreapurestream-processingplatform,wewilluseSamza,whichisavailableathttps://samza.apache.org.Thecodeshownherewastestedwiththecurrent0.8releaseandwe’llkeeptheGitHubrepositoryupdatedastheprojectcontinuestoevolve.

SamzawasbuiltatLinkedInanddonatedtotheApacheSoftwareFoundationinSeptember2013.Overtheyears,LinkedInhasbuiltamodelthatconceptualizesmuchoftheirdataasstreams,andfromthistheysawtheneedforaframeworkthatcanprovideadeveloper-friendlymechanismtoprocesstheseubiquitousdatastreams.

TheteamatLinkedInrealizedthatwhenitcametodataprocessing,muchoftheattentionwenttotheextremeendsofthespectrum,forexample,RPCworkloadsareusuallyimplementedassynchronoussystemswithverylowlatencyrequirementsorbatchsystemswheretheperiodicityofjobsisoftenmeasuredinhours.ThegroundinbetweenhasbeenrelativelypoorlysupportedandthisistheareathatSamzaistargetedat;mostofitsjobsexpectresponsetimesrangingfrommillisecondstominutes.Theyalsoassumethatdataarrivesinatheoreticallyinfinitestreamofcontinuousmessages.

Page 184: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HowSamzaworksTherearenumerousstream-processingsystemssuchasStorm(http://storm.apache.org),intheopensourceworld,andmanyother(mostlycommercial)toolssuchascomplexeventprocessing(CEP)systemsthatalsotargetprocessingoncontinuousmessagestreams.Thesesystemshavemanysimilaritiesbutalsosomemajordifferences.

ForSamza,perhapsthemostsignificantdifferenceisitsassumptionsaboutmessagedelivery.Manysystemsworkveryhardtoreducethelatencyofeachmessage,sometimeswithanassumptionthatthegoalistogetthemessageintoandoutofthesystemasfastaspossible.Samzaassumesalmosttheopposite;itsstreamsarepersistentandresilientandanymessagewrittentoastreamcanbere-readforaperiodoftimeafteritsfirstarrival.Aswewillsee,thisgivessignificantcapabilityaroundfaulttolerance.Samzaalsobuildsonthismodeltoalloweachofitstaskstoholdresilientlocalstate.

SamzaismostlyimplementedinScalaeventhoughitspublicAPIsarewritteninJava.We’llshowJavaexamplesinthischapter,butanyJVMlanguagecanbeusedtoimplementSamzaapplications.We’lldiscussScalawhenweexploreSparkinthenextchapter.

Page 185: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Samzahigh-levelarchitectureSamzaviewstheworldashavingthreemainlayersorcomponents:thestreaming,execution,andprocessinglayers.

Samzaarchitecture

Thestreaminglayerprovidesaccesstothedatastreams,bothforconsumptionandpublication.TheexecutionlayerprovidesthemeansbywhichSamzaapplicationscanberun,haveresourcessuchasCPUandmemoryallocated,andhavetheirlifecyclesmanaged.TheprocessinglayeristheactualSamzaframeworkitself,anditsinterfacesallowper-messagefunctionality.

SamzaprovidespluggableinterfacestosupportthefirsttwolayersthoughthecurrentmainimplementationsuseKafkaforstreamingandYARNforexecution.We’lldiscussthesefurtherinthefollowingsections.

Page 186: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Samza’sbestfriend–ApacheKafkaSamzaitselfdoesnotimplementtheactualmessagestream.Instead,itprovidesaninterfaceforamessagesystemwithwhichitthenintegrates.ThedefaultstreamimplementationisbuiltuponApacheKafka(http://kafka.apache.org),amessagingsystemalsobuiltatLinkedInbutnowasuccessfulandwidelyadoptedopensourceproject.

KafkacanbeviewedasamessagebrokerakintosomethinglikeRabbitMQorActiveMQ,butasmentionedearlier,itwritesallmessagestodiskandscalesoutacrossmultiplehostsasacorepartofitsdesign.Kafkausestheconceptofapublish/subscribemodelthroughnamedtopicstowhichproducerswritemessagesandfromwhichconsumersreadmessages.Theseworkmuchliketopicsinanyothermessagingsystem.

BecauseKafkawritesallmessagestodisk,itmightnothavethesameultra-lowlatencymessagethroughputasothermessagingsystems,whichfocusongettingthemessageprocessedasfastaspossibleanddon’taimtostorethemessagelongterm.Kafkacan,however,scaleexceptionallywellanditsabilitytoreplayamessagestreamcanbeextremelyuseful.Forexample,ifaconsumingclientfails,thenitcanre-readmessagesfromaknowngoodpointintime,orifadownstreamalgorithmchanges,thentrafficcanbereplayedtoutilizethenewfunctionality.

Whenscalingacrosshosts,Kafkapartitionstopicsandsupportspartitionreplicationforfaulttolerance.EachKafkamessagehasakeyassociatedwiththemessageandthisisusedtodecidetowhichpartitionagivenmessageissent.Thisallowssemanticallyusefulpartitioning,forexample,ifthekeyisauserIDinthesystem,thenallmessagesforagivenuserwillbesenttothesamepartition.Kafkaguaranteesordereddeliverywithineachpartitionsothatanyclientreadingapartitioncanknowthattheyarereceivingallmessagesforeachkeyinthatpartitionintheorderinwhichtheyarewrittenbytheproducer.

Samzaperiodicallywritesoutcheckpointsofthepositionuptowhichithasreadinallthestreamsitisconsuming.ThesecheckpointmessagesarethemselveswrittentoaKafkatopic.Thus,whenaSamzajobstartsup,eachtaskcanrereaditscheckpointstreamtoknowfromwhichpositioninthestreamtostartprocessingmessages.ThismeansthatineffectKafkaalsoactsasabuffer;ifaSamzajobcrashesoristakendownforupgrade,nomessageswillbelost.Instead,thejobwilljustrestartfromthelastcheckpointedpositionwhenitrestarts.Thisbufferfunctionalityisalsoimportant,asitmakesiteasierformultipleSamzajobstorunaspartofacomplexworkflow.WhenKafkatopicsarethepointsofcoordinationbetweenthejobs,onejobmightconsumeatopicbeingwrittentobyanother;insuchcases,Kafkacanhelpsmoothoutissuescausedduetoanygivenjobrunningslowerthanothers.Traditionally,thebackpressurecausedbyaslowrunningjobcanbearealissueinasystemcomprisedofmultiplejobstages,butKafkaastheresilientbufferallowseachjobtoreadandwriteatitsownrate.NotethatthisisanalogoustohowmultiplecoordinatingMapReducejobswilluseHDFSforsimilarpurposes.

Kafkaprovidesat-leastoncemessagedeliverysemantics,thatistosaythatanymessage

Page 187: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

writtentoKafkawillbeguaranteedtobeavailabletoaclientoftheparticularpartition.Messagesmightbeprocessedbetweencheckpointshowever;itispossibleforduplicatemessagestobereceivedbytheclient.Thereareapplication-specificmechanismstomitigatethis,andbothKafkaandSamzahaveexactly-oncesemanticsontheirroadmaps,butfornowitissomethingyoushouldtakeintoconsiderationwhendesigningjobs.

Wewon’texplainKafkafurtherbeyondwhatweneedtodemonstrateSamza.Ifyouareinterested,checkoutitswebsiteandwiki;thereisalotofgoodinformation,includingsomeexcellentpapersandpresentations.

Page 188: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YARNintegrationAsmentionedearlier,justasSamzautilizesKafkaforitsstreaminglayerimplementation,itusesYARNfortheexecutionlayer.JustlikeanyYARNapplicationdescribedinChapter3,Processing–MapReduceandBeyond,SamzaprovidesanimplementationofbothanApplicationMaster,whichcontrolsthelifecycleoftheoveralljob,plusimplementationsofSamza-specificfunctionality(calledtasks)thatareexecutedineachcontainer.JustasKafkapartitionsitstopics,tasksarethemechanismbywhichSamzapartitionsitsprocessing.EachKafkapartitionwillbereadbyasingleSamzatask.IfaSamzajobconsumesmultiplestreams,thenagiventaskwillbetheonlyconsumerwithinthejobforeverystreampartitionassignedtoit.

TheSamzaframeworkistoldbyeachjobconfigurationabouttheKafkastreamsthatareofinteresttothejob,andSamzacontinuouslypollsthesestreamstodetermineifanynewmessageshavearrived.Whenanewmessageisavailable,theSamzataskinvokesauser-definedcallbacktoprocessthemessage,amodelthatshouldn’tlooktooalientoMapReducedevelopers.ThismethodisdefinedinaninterfacecalledStreamTaskandhasthefollowingsignature:

publicvoidprocess(IncomingMessageEnvelopeenvelope,

MessageCollectorcollector,

TaskCoordinatorcoordinator)

ThisisthecoreofeachSamzataskanddefinesthefunctionalitytobeappliedtoreceivedmessages.ThereceivedmessagethatistobeprocessediswrappedintheIncomingMessageEnvelope;outputmessagescanbewrittentotheMessageCollector,andtaskmanagement(suchasShutdown)canbeperformedviatheTaskCoordinator.

Asmentioned,SamzacreatesonetaskinstanceforeachpartitionintheunderlyingKafkatopic.EachYARNcontainerwillmanageoneormoreofthesetasks.TheoverallmodelthenisoftheSamzaApplicationMastercoordinatingmultiplecontainers,eachofwhichisresponsibleforoneormoreStreamTaskinstances.

Page 189: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AnindependentmodelThoughwewilltalkexclusivelyofKafkaandYARNastheprovidersofSamza’sstreamingandexecutionlayersinthischapter,itisimportanttorememberthatthecoreSamzasystemuseswell-definedinterfacesforboththestreamandexecutionsystems.Thereareimplementationsofmultiplestreamsources(we’llseeoneinthenextsection)andalongsidetheYARNsupport,SamzashipswithaLocalJobRunnerclass.ThisalternativemethodofrunningtaskscanexecuteStreamTaskinstancesin-processontheJVMinsteadofrequiringafullYARNcluster,whichcansometimesbeausefultestinganddebuggingtool.ThereisalsoadiscussionofSamzaimplementationsontopofotherclustermanagerorvirtualizationframeworks.

Page 190: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HelloSamza!SincenoteveryonealreadyhasZooKeeper,Kafka,andYARNclustersreadytobeused,theSamzateamhascreatedawonderfulwaytogetstartedwiththeproduct.InsteadofjusthavingaHelloworld!program,thereisarepositorycalledHelloSamza,whichisavailablebycloningtherepositoryatgit://git.apache.org/samza-hello-samza.git.

ThiswilldownloadandinstalldedicatedinstancesofZooKeeper,Kafka,andYARN(the3majorprerequisitesforSamza),creatingafullstackuponwhichyoucansubmitSamzajobs.

TherearealsoanumberofexampleSamzajobsthatprocessdatafromWikipediaeditnotifications.Takealookatthepageathttp://samza.apache.org/startup/hello-samza/0.8/andfollowtheinstructionsgiventhere.(Atthetimeofwriting,Samzaisstillarelativelyyoungprojectandwe’drathernotincludedirectinformationabouttheexamples,whichmightbesubjecttochange).

FortheremainderoftheSamzaexamplesinthischapter,we’llassumeyouareeitherusingtheHelloSamzapackagetoprovidethenecessarycomponents(ZooKeeper/Kafka/YARN)oryouhaveintegratedwithotherinstancesofeach.

ThisexamplehasthreedifferentSamzajobsthatbuilduponeachother.ThefirstreadstheWikipediaedits,thesecondparsestheserecords,andthethirdproducesstatisticsbasedontheprocessedrecords.We’llbuildourownmultistreamworkflowshortly.

OneinterestingpointistheWikipediaFeedexamplehere;itusesWikipediaasitsmessagesourceinsteadofKafka.Specifically,itprovidesanotherimplementationoftheSamzaSystemConsumerinterfacetoallowSamzatoreadmessagesfromanexternalsystem.Asmentionedearlier,SamzaisnottiedtoKafkaand,asthisexampleshows,buildinganewstreamimplementationdoesnothavetobeagainstagenericinfrastructurecomponent;itcanbequitejob-specific,astheworkrequiredisnothuge.

TipNotethatthedefaultconfigurationforbothZooKeeperandKafkawillwritesystemdatatodirectoriesunder/tmp,whichwillbewhatyouhavesetifyouuseHelloSamza.BecarefulifyouareusingaLinuxdistributionthatpurgesthecontentsofthisdirectoryonareboot.Ifyouplantocarryoutanysignificanttesting,thenit’sbesttoreconfigurethesecomponentstouselessephemerallocations.Changetherelevantconfigfilesforeachservice;theyarelocatedintheservicedirectoryunderthehello-samza/deploydirectory.

Page 191: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BuildingatweetparsingjobLet’sbuildourownsimplejobimplementationtoshowthefullcoderequired.We’lluseparsingoftheTwitterstreamastheexamplesinthischapterandwilllatersetupapipefromourclientconsumingmessagesfromtheTwitterAPIintoaKafkatopic.So,weneedaSamzataskthatwillreadthestreamofJSONmessages,extracttheactualtweettext,andwritethesetoatopicoftweets.

HereisthemaincodefromTwitterParseStreamTask.java,availableathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterParseStreamTask.java

packagecom.learninghadoop2.samza.tasks;

publicclassTwitterParseStreamTaskimplementsStreamTask{

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

Stringmsg=((String)envelope.getMessage());

try{

JSONParserparser=newJSONParser();

Objectobj=parser.parse(msg);

JSONObjectjsonObj=(JSONObject)obj;

Stringtext=(String)jsonObj.get("text");

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweets-parsed"),text));

}catch(ParseExceptionpe){}

}

}

}

Thecodeislargelyself-explanatory,butthereareafewpointsofinterest.WeuseJSONSimple(http://code.google.com/p/json-simple/)forourrelativelystraightforwardJSONparsingrequirements;we’llalsouseitlaterinthisbook.

TheIncomingMessageEnvelopeanditscorrespondingOutputMessageEnvelopearethemainstructuresconcernedwiththeactualmessagedata.Alongwiththemessagepayload,theenvelopewillalsohavedataconcerningthesystem,topicname,and(optionally)partitionnumberinadditiontoothermetadata.Forourpurposes,wejustextractthemessagebodyfromtheincomingmessageandsendthetweettextweextractfromitviaanewOutgoingMessageEnvelopetoatopiccalledtweets-parsedwithinasystemcalledkafka.Notethelowercasename—we’llexplainthisinamoment.

ThetypeofmessageintheIncomingMessageEnvelopeisjava.lang.Object.Samzadoesnotcurrentlyenforceadatamodelandhencedoesnothavestrongly-typedmessagebodies.Therefore,whenextractingthemessagecontents,anexplicitcastisusuallyrequired.Sinceeachtaskneedstoknowtheexpectedmessageformatofthestreamsitprocesses,thisisnottheodditythatitmayappeartobe.

Page 192: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheconfigurationfileTherewasnothinginthepreviouscodethatsaidwherethemessagescamefrom;theframeworkjustpresentsthemtotheStreamTaskimplementation,butobviouslySamzaneedstoknowfromwheretofetchmessages.Thereisaconfigurationfileforeachjobthatdefinesthisandmore.Thefollowingcanbefoundastwitter-parse.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-parser.properties:

#Job

job.factory.class=org.apache.samza.job.yarn.YarnJobFactory

job.name=twitter-parser

#YARN

yarn.package.path=file:///home/gturkington/samza/build/distributions/learni

nghadoop2-0.1.tar.gz

#Task

task.class=com.learninghadoop2.samza.tasks.TwitterParseStreamTask

task.inputs=kafka.tweets

task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointMa

nagerFactory

task.checkpoint.system=kafka

#Normally,thiswouldbe3,butwehaveonlyonebroker.

task.checkpoint.replication.factor=1

#Serializers

serializers.registry.string.class=org.apache.samza.serializers.StringSerdeF

actory

#Systems

systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactor

y

systems.kafka.streams.tweets.samza.msg.serde=string

systems.kafka.streams.tweets-parsed.samza.msg.serde=string

systems.kafka.consumer.zookeeper.connect=localhost:2181/

systems.kafka.consumer.auto.offset.reset=largest

systems.kafka.producer.metadata.broker.list=localhost:9092

systems.kafka.producer.producer.type=sync

systems.kafka.producer.batch.num.messages=1

Thismaylooklikealot,butfornowwe’lljustconsiderthehigh-levelstructureandsomekeysettings.ThejobsectionsetsYARNastheexecutionframework(asopposedtothelocaljobrunnerclass)andgivesthejobaname.Ifweweretorunmultiplecopiesofthissamejob,wewouldalsogiveeachcopyauniqueID.Thetasksectionspecifiestheimplementationclassofourtaskandalsothenameofthestreamsforwhichitshouldreceivemessages.SerializerstellSamzahowtoreadandwritemessagestoandfromthestreamandthesystemsectiondefinessystemsbynameandassociatesimplementationclasseswiththem.

Inourcase,wedefineonlyonesystemcalledkafkaandwerefertothissystemwhen

Page 193: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

sendingourmessageintheprecedingtask.Notethatthisnameisarbitraryandwecouldcallitwhateverwewant.Obviously,forclarityitmakessensetocalltheKafkasystembythesamenamebutthisisonlyaconvention.Inparticular,sometimesyouwillneedtogivedifferentnameswhendealingwithmultiplesystemsthataresimilartoeachother,orsometimesevenwhentreatingthesamesystemdifferentlyindifferentpartsofaconfigurationfile.

Inthissection,wewillalsospecifytheSerDetobeassociatedwiththestreamsusedbythetask.RecallthatKafkamessageshaveabodyandanoptionalkeythatisusedtodeterminetowhichpartitionthemessageissent.Samzaneedstoknowhowtotreatthecontentsofthekeysandmessagesforthesestreams.Samzahassupporttotreattheseasrawbytesorspecifictypessuchasstring,integer,andJSON,asmentionedearlier.

Therestoftheconfigurationwillbemostlyunchangedfromjobtojob,asitincludesthingssuchasthelocationoftheZooKeeperensembleandKafkaclusters,andspecifieshowstreamsaretobecheckpointed.Samzaallowsawidevarietyofcustomizationsandthefullconfigurationoptionsaredetailedathttp://samza.apache.org/learn/documentation/0.8/jobs/configuration-table.html.

Page 194: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingTwitterdataintoKafkaBeforewerunthejob,wedoneedtogetsometweetsintoKafka.Let’screateanewKafkatopiccalledtweetstowhichwe’llwritethetweets.

ToperformthisandotherKafka-relatedoperations,we’llusecommand-linetoolslocatedwithinthebindirectoryoftheKafkadistribution.IfyouarerunningajobfromwithinthestackcreatedaspartoftheHelloSamzaapplication;thiswillbedeploy/kafka/bin.

kafka-topics.shisageneral-purposetoolthatcanbeusedtocreate,update,anddescribetopics.MostofitsusagesrequireargumentstospecifythelocationofthelocalZooKeepercluster,whereKafkabrokersstoretheirdetails,andthenameofthetopictobeoperatedupon.Tocreateanewtopic,runthefollowingcommand:

$kafka-topics.sh--zookeeperlocalhost:2181--create–topictweets--

partitions1--replication-factor1

Thiscreatesatopiccalledtweetsandexplicitlysetsitsnumberofpartitionsandreplicationfactorto1.ThisissuitableifyouarerunningKafkawithinalocaltestVM,butclearlyproductiondeploymentswillhavemorepartitionstoscaleouttheloadacrossmultiplebrokersandareplicationfactorofatleast2toprovidefaulttolerance.

Usethelistoptionofthekafka-topics.shtooltosimplyshowthetopicsinthesystem,orusedescribetogetmoredetailedinformationonspecifictopics:

$kafka-topics.sh--zookeeperlocalhost:2181--describe--topictweets

Topic:tweetsPartitionCount:1ReplicationFactor:1Configs:

Topic:tweetsPartition:0Leader:0Replicas:0Isr:0

Themultiple0sarepossiblyconfusingasthesearelabelsandnotcounts.EachbrokerinthesystemhasanIDthatusuallystartsfrom0,asdothepartitionswithineachtopic.TheprecedingoutputistellingusthatthetopiccalledtweetshasasinglepartitionwithID0,thebrokeractingastheleaderforthatpartitionisbroker0,andthesetofin-syncreplicas(ISR)forthispartitionisagainonlybroker0.Thislastvalueisparticularlyimportantwhendealingwithreplication.

We’lluseourPythonutilityfrompreviouschapterstopullJSONtweetsfromtheTwitterfeed,andthenuseaKafkaCLImessageproducertowritethemessagestoaKafkatopic.Thisisn’taterriblyefficientwayofdoingthings,butitissuitableforillustrationpurposes.AssumingourPythonscriptisinourhomedirectory,runthefollowingcommandfromwithintheKafkabindirectory:

$python~/stream.py–j|./kafka-console-producer.sh--broker-list

localhost:9092--topictweets

ThiswillrunindefinitelysobecarefulnottoleaveitrunningovernightonatestVMwithsmalldiskspace,notthattheauthorshaveeverdonesuchathing.

Page 195: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

RunningaSamzajobTorunaSamzajob,weneedourcodetobepackagedalongwiththeSamzacomponentsrequiredtoexecuteitintoa.tar.gzarchivethatwillbereadbytheYARNNodeManager.Thisisthefilereferredtobytheyarn.file.packagepropertyintheSamzataskconfigurationfile.

WhenusingthesinglenodeHelloSamzawecanjustuseanabsolutepathonthefilesystem,asseeninthepreviousconfigurationexample.ForjobsonlargerYARNgrids,theeasiestwayistoputthepackageontoHDFSandrefertoitbyanhdfs://URIoronawebserver(SamzaprovidesamechanismtoallowYARNtoreadthefileviaHTTP).

BecauseSamzahasmultiplesubcomponentsandeachsubcomponenthasitsowndependencies,thefullYARNpackagecanendupcontainingalotofJARfiles(over100!).Inaddition,youneedtoincludeyourcustomcodefortheSamzataskaswellassomescriptsfromwithintheSamzadistribution.It’snotsomethingtobedonebyhand.Inthesamplecodeforthischapter,foundathttps://github.com/learninghadoop2/book-examples/tree/master/ch4,wehavesetupasamplestructuretoholdthecodeandconfigfilesandprovidedsomeautomationviaGradletobuildthenecessarytaskarchiveandstartthetasks.

WhenintherootoftheSamzaexamplecodedirectoryforthisbook,performthefollowingcommandtobuildasinglefilearchivecontainingalltheclassesofthischaptercompiledtogetherandbundledwithalltheotherrequiredfiles:

$./gradlewtargz

ThisGradletaskwillnotonlycreatethenecessary.tar.gzarchiveinthebuild/distributionsdirectory,butwillalsostoreanexpandedversionofthearchiveunderbuild/samza-package.Thiswillbeuseful,aswewilluseSamzascriptsstoredinthebindirectoryofthearchivetoactuallysubmitthetasktoYARN.

Sonow,let’srunourjob.Weneedtohavefilepathsfortwothings:theSamzarun-job.shscripttosubmitajobtoYARNandtheconfigurationfileforourjob.Sinceourcreatedjobpackagehasallthecompiledtasksbundledtogether,itisbyusingadifferentconfigurationfilethatspecifiesaspecifictaskimplementationclassinthetask.classpropertythatwetellSamzawhichtasktorun.Toactuallyrunthetask,wecanrunthefollowingcommandfromwithintheexplodedprojectarchiveunderbuild/samza-archives:

$bin/run-job.sh--config-

factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-

path=]config/twitter-parser.properties

Forconvenience,weaddedaGradletasktorunthisjob:

$./gradlewrunTwitterParser

Toseetheoutputofthejob,we’llusetheKafkaCLIclienttoconsumemessages:

Page 196: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

$./kafka-console-consumer.sh–zookeeperlocalhost:2181–topictweets-

parsed

Youshouldseeacontinuousstreamoftweetsappearingontheclient.

NoteNotethatwedidnotexplicitlycreatethetopiccalledtweets-parsed.Kafkacanallowtopicstobecreateddynamicallywheneitheraproducerorconsumertriestousethetopic.Inmanysituations,thoughthedefaultpartitioningandreplicationvaluesmaynotbesuitable,andexplicittopiccreationwillberequiredtoensurethesecriticaltopicattributesarecorrectlydefined.

Page 197: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SamzaandHDFSYoumayhavenoticedthatwejustmentionedHDFSforthefirsttimeinourdiscussionofSamza.ThoughSamzaintegratestightlywithYARN,ithasnodirectintegrationwithHDFS.Atalogicallevel,Samza’sstream-implementingsystems(suchasKafka)areprovidingthestoragelayerthatisusuallyprovidedbyHDFSfortraditionalHadoopworkloads.IntheterminologyofSamza’sarchitecture,asdescribedearlier,YARNistheexecutionlayerinbothmodels,whereasSamzausesastreaminglayerforitssourceanddestinationdata,frameworkssuchasMapReduceuseHDFS.ThisisagoodexampleofhowYARNenablesalternativecomputationalmodelsthatnotonlyprocessdataverydifferentlythanbatch-orientedMapReduce,butthatcanalsouseentirelydifferentstoragesystemsfortheirsourcedata.

Page 198: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WindowingfunctionsIt’sfrequentlyusefultogeneratesomedatabasedonthemessagesreceivedonastreamoveracertaintimewindow.Anexampleofthismaybetorecordthetopnattributevaluesmeasuredeveryminute.SamzasupportsthisthroughtheWindowableTaskinterface,whichhasthefollowingsinglemethodtobeimplemented:

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator);

ThisshouldlooksimilartotheprocessmethodintheStreamTaskinterface.However,becausethemethodiscalledonatimeschedule,itsinvocationisnotassociatedwithareceivedmessage.TheMessageCollectorandTaskCoordinatorparametersarestillthere,however,asmostwindowabletaskswillproduceoutputmessagesandmayalsowishtoperformsometaskmanagementactions.

Let’stakeourprevioustaskandaddawindowfunctionthatwilloutputthenumberoftweetsreceivedineachwindowedtimeperiod.ThisisthemainclassimplementationofTwitterStatisticsStreamTask.javafoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatisticsStreamTask.java

publicclassTwitterStatisticsStreamTaskimplementsStreamTask,

WindowableTask{

privateinttweets=0;

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

tweets++;

}

@Override

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator){

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweet-stats"),""+tweets));

//Resetcountsafterwindowing.

tweets=0;

}

}

TheTwitterStatisticsStreamTaskclasshasaprivatemembervariablecalledtweetsthatisinitializedto0andisincrementedineverycalltotheprocessmethod.Wethereforeknowthatthisvariablewillbeincrementedforeachmessagepassedtothetaskfromtheunderlyingstreamimplementation.EachSamzacontainerhasasinglethreadrunninginaloopthatexecutestheprocessandwindowmethodsonallthetaskswithinthecontainer.Thismeansthatwedonotneedtoguardinstancevariablesagainstconcurrentmodifications;onlyonemethodoneachtaskwithinacontainerwillbeexecutingsimultaneously.

Page 199: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Inourwindowmethod,wesendamessagetoanewtopicwecalltweet-statsandthenresetthetweetsvariable.ThisisprettystraightforwardandtheonlymissingpieceishowSamzawillknowwhentocallthewindowmethod.Wespecifythisintheconfigurationfile:

task.window.ms=5000

ThistellsSamzatocallthewindowmethodoneachtaskinstanceevery5seconds.Torunthewindowtask,thereisaGradletask:

$./gradlewrunTwitterStatistics

Ifweusekafka-console-consumer.shtolistenonthetweet-statsstreamnow,wewillseethefollowingoutput:

Numberoftweets:5012

Numberoftweets:5398

NoteNotethatthetermwindowinthiscontextreferstoSamzaconceptuallyslicingthestreamofmessagesintotimerangesandprovidingamechanismtoperformprocessingateachrangeboundary.Samzadoesnotdirectlyprovideanimplementationoftheotheruseofthetermwithregardstoslidingwindows,whereaseriesofvaluesisheldandprocessedovertime.However,thewindowabletaskinterfacedoesprovidetheplumbingtoimplementsuchslidingwindows.

Page 200: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MultijobworkflowsAswesawwiththeHelloSamzaexamples,someoftherealpowerofSamzacomesfromcompositionofmultiplejobsandwe’lluseatextcleanupjobtostartdemonstratingthiscapability.

Inthefollowingsection,we’llperformtweetsentimentanalysisbycomparingtweetswithasetofEnglishpositiveandnegativewords.SimplyapplyingthistotherawTwitterfeedwillhaveverypatchyresults,however,givenhowrichlymultilingualtheTwitterstreamis.Wealsoneedtoconsiderthingssuchastextcleanup,capitalization,frequentcontractions,andsoon.Asanyonewhohasworkedwithanynon-trivialdatasetknows,theactofmakingthedatafitforprocessingisusuallywherealargeamountofeffort(oftenthemajority!)goes.

Sobeforewetryanddetecttweetsentiments,let’sdosomesimpletextcleanup;inparticular,we’llselectonlyEnglishlanguagetweetsandwewillforcetheirtexttobelowercasebeforesendingthemtoanewoutputstream.

Languagedetectionisadifficultproblemandforthiswe’lluseafeatureoftheApacheTikalibrary(http://tika.apache.org).Tikaprovidesawidearrayoffunctionalitytoextracttextfromvarioussourcesandthentoextractfurtherinformationfromthattext.IfusingourGradlescripts,theTikadependencyisalreadyspecifiedandwillautomaticallybeincludedinthegeneratedjobpackage.Ifbuildingthroughanothermechanism,youwillneedtodownloadtheTikaJARfilefromthehomepageandaddittoyourYARNjobpackage.ThefollowingcodecanbefoundasTextCleanupStreamTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TextCleanupStreamTask.java

publicclassTextCleanupStreamTaskimplementsStreamTask{

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

Stringrawtext=((String)envelope.getMessage());

if("en".equals(detectLanguage(rawtext))){

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","english-tweets"),

rawtext.toLowerCase()));

}

}

privateStringdetectLanguage(Stringtext){

LanguageIdentifierli=newLanguageIdentifier(text);

returnli.getLanguage();

}

}

ThistaskisquitestraightforwardthankstotheheavyliftingperformedbyTika.WecreateautilitymethodthatwrapsthecreationanduseofaTika,LanguageDetector,andthenwe

Page 201: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

callthismethodonthemessagebodyofeachincomingmessageintheprocessmethod.Weonlywritetotheoutputstreamiftheresultofapplyingthisutilitymethodis"en",thatis,thetwo-lettercodeforEnglish.

Theconfigurationfileforthistaskissimilartothatofourprevioustask,withthespecificvaluesforthetasknameandimplementingclass.Itisintherepositoryastextcleanup.propertiesathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/textcleanup.properties.Wealsoneedtospecifytheinputstream:

task.inputs=kafka.tweets-parsed

ThisisimportantbecauseweneedthistasktoparsethetweettextthatwasextractedintheearliertaskandavoidduplicatingtheJSONparsinglogicthatisbestencapsulatedinoneplace.Wecanrunthistaskwiththefollowingcommand:

$./gradlewrunTextCleanup

Now,wecanrunallthreetaskstogether;TwitterParseStreamTaskandTwitterStatisticsStreamTaskwillconsumetherawtweetstream,whileTextCleanupStreamTaskwillconsumetheoutputfromTwitterParseStreamTask.

Dataprocessingonstreams

Page 202: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TweetsentimentanalysisWe’llnowimplementatasktoperformtweetsentimentanalysissimilartowhatwedidusingMapReduceinthepreviouschapter.ThiswillalsoshowusausefulmechanismofferedbySamza:bootstrapstreams.

BootstrapstreamsGenerallyspeaking,moststream-processingjobs(inSamzaoranotherframework)willstartprocessingmessagesthatarriveaftertheystartupandgenerallyignorehistoricalmessages.Becauseofitsconceptofreplayablestreams,Samzadoesn’thavethislimitation.

Inoursentimentanalysisjob,wehadtwosetsofreferenceterms:positiveandnegativewords.Thoughwe’venotshownitsofar,Samzacanconsumemessagesfrommultiplestreamsandtheunderlyingmachinerywillpollallnamedstreamsandprovidetheirmessages,oneatatime,totheprocessmethod.Wecanthereforecreatestreamsforthepositiveandnegativewordsandpushthedatasetsontothosestreams.Atfirstglance,wecouldplantorewindthesetwostreamstotheearliestpointandreadtweetsastheyarrive.TheproblemisthatSamzawon’tguaranteeorderingofmessagesfrommultiplestreams,andeventhoughthereisamechanismtogivestreamshigherpriority,wecan’tassumethatallnegativeandpositivewordswillbeprocessedbeforethefirsttweetarrives.

Forsuchtypesofscenarios,Samzahastheconceptofbootstrapstreams.Ifataskhasanybootstrapstreamsdefined,thenitwillreadthesestreamsfromtheearliestoffsetuntiltheyarefullyprocessed(technically,itwillreadthestreamstilltheygetcaughtup,sothatanynewwordssenttoeitherstreamwillbetreatedwithoutpriorityandwillarriveinterleavedbetweentweets).

We’llnowcreateanewjobcalledTweetSentimentStreamTaskthatreadstwobootstrapstreams,collectstheircontentsintoHashMaps,gathersrunningcountsforsentimenttrends,andusesawindowfunctiontooutputthisdataatintervals.Thiscodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterSentimentStreamTask.java

publicclassTwitterSentimentStreamTaskimplementsStreamTask,

WindowableTask{

privateSet<String>positiveWords=newHashSet<String>();

privateSet<String>negativeWords=newHashSet<String>();

privateinttweets=0;

privateintpositiveTweets=0;

privateintnegativeTweets=0;

privateintmaxPositive=0;

privateintmaxNegative=0;

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

if("positive-

words".equals(envelope.getSystemStreamPartition().getStream())){

Page 203: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

positiveWords.add(((String)envelope.getMessage()));

}elseif("negative-

words".equals(envelope.getSystemStreamPartition().getStream())){

negativeWords.add(((String)envelope.getMessage()));

}elseif("english-

tweets".equals(envelope.getSystemStreamPartition().getStream())){

tweets++;

intpositive=0;

intnegative=0;

Stringwords=((String)envelope.getMessage());

for(Stringword:words.split("")){

if(positiveWords.contains(word)){

positive++;

}elseif(negativeWords.contains(word)){

negative++;

}

}

if(positive>negative){

positiveTweets++;

}

if(negative>positive){

negativeTweets++;

}

if(positive>maxPositive){

maxPositive=positive;

}

if(negative>maxNegative){

maxNegative=negative;

}

}

}

@Override

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator){

Stringmsg=String.format("Tweets:%dPositive:%dNegative:%d

MaxPositive:%dMinPositive:%d",tweets,positiveTweets,negativeTweets,

maxPositive,maxNegative);

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweet-sentiment-stats"),msg));

//Resetcountsafterwindowing.

tweets=0;

positiveTweets=0;

negativeTweets=0;

maxPositive=0;

maxNegative=0;

}

Page 204: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

}

Inthistask,weaddanumberofprivatemembervariablesthatwewillusetokeeparunningcountofthenumberofoveralltweets,howmanywerepositiveandnegative,andthemaximumpositiveandnegativecountsseeninasingletweet.

ThistaskconsumesfromthreeKafkatopics.Eventhoughwewillconfiguretwotobeusedasbootstrapstreams,theyareallstillexactlythesametypeofKafkatopicfromwhichmessagesarereceived;theonlydifferencewithbootstrapstreamsisthatwetellSamzatouseKafka’srewindingcapabilitiestofullyre-readeachmessageinthestream.Fortheotherstreamoftweets,wejuststartreadingnewmessagesastheyarrive.

Ashintedearlier,ifatasksubscribestomultiplestreams,thesameprocessmethodwillreceivemessagesfromeachstream.Thatiswhyweuseenvelope.getSystemStreamPartition().getStream()toextractthestreamnameforeachgivenmessageandthenactaccordingly.Ifthemessageisfromeitherofthebootstrappedstreams,weadditscontentstotheappropriatehashmap.Webreakatweetmessageintoitsconstituentwords,testeachwordforpositiveornegativesentiment,andthenupdatecountsaccordingly.Asyoucansee,thistaskdoesn’toutputthereceivedtweetstoanothertopic.

Sincewedon’tperformanydirectprocessing,thereisnopointindoingso;anyothertaskthatwishestoconsumemessagescanjustsubscribedirectlytotheincomingtweetsstream.However,apossiblemodificationcouldbetowritepositiveandnegativesentimenttweetstodedicatedstreamsforeach.

Thewindowmethodoutputsaseriesofcountsandthenresetsthevariables(asitdidbefore).NotethatSamzadoeshavesupporttodirectlyexposemetricsthroughJMX,whichcouldpossiblybeabetterfitforsuchsimplewindowingexamples.However,wewon’thavespacetocoverthataspectoftheprojectinthisbook.

Torunthisjob,weneedtomodifytheconfigurationfilebysettingthejobandtasknamesasusual,butwealsoneedtospecifymultipleinputstreamsnow:

task.inputs=kafka.english-tweets,kafka.positive-words,kafka.negative-words

Then,weneedtospecifythattwoofourstreamsarebootstrapstreamsthatshouldbereadfromtheearliestoffset.Specifically,wesetthreepropertiesforthestreams.Wesaytheyaretobebootstrapped,thatis,fullyreadbeforeotherstreams,andthisisachievedbyspecifyingthattheoffsetoneachstreamneedstoberesettotheoldest(first)position:

systems.kafka.streams.positive-words.samza.bootstrap=true

systems.kafka.streams.positive-words.samza.reset.offset=true

systems.kafka.streams.positive-words.samza.offset.default=oldest

systems.kafka.streams.negative-words.samza.bootstrap=true

systems.kafka.streams.negative-words.samza.reset.offset=true

systems.kafka.streams.negative-words.samza.offset.default=oldest

Wecanrunthisjobwiththefollowingcommand:

Page 205: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

$./gradlewrunTwitterSentiment

Afterstartingthejob,lookattheoutputofthemessagesonthetweet-sentiment-statstopic.

Thesentimentdetectionjobwillbootstrapthepositiveandnegativewordstreamsbeforereadinganyofournewlydetectedlower-caseEnglishtweets.

Withthesentimentdetectionjob,wecannowvisualizeourfourcollaboratingjobsasshowninthefollowingdiagram:

Bootstrapstreamsandcollaboratingtasks

TipTocorrectlyrunthejobs,itmayseemnecessarytostarttheJSONparserjobfollowedbythecleanupjobbeforefinallystartingthesentimentjob,butthisisnotthecase.AnyunreadmessagesremainbufferedinKafka,soitdoesn’tmatterinwhichorderthejobsofamulti-jobworkflowarestarted.Ofcourse,thesentimentjobwilloutputcountsof0tweetsuntilitstartsreceivingdata,butnothingwillbreakifastreamjobstartsbeforethoseitdependson.

Page 206: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StatefultasksThefinalaspectofSamzathatwewillexploreishowitallowsthetasksprocessingstreampartitionstohavepersistentlocalstate.Inthepreviousexample,weusedprivatevariablestokeepatrackofrunningtotals,butsometimesitisusefulforatasktohavericherlocalstate.Anexamplecouldbetheactofperformingalogicaljoinontwostreams,whereitisusefultobuildupastatemodelfromonestreamandcomparethiswiththeother.

NoteNotethatSamzacanutilizeitsconceptofpartitionedstreamstogreatlyoptimizetheactofjoiningstreams.Ifeachstreamtobejoinedusesthesamepartitionkey(forexample,auserID),theneachtaskconsumingthesestreamswillreceiveallmessagesassociatedwitheachIDacrossallthestreams.

Samzahasanotherabstractionsimilartoitsnotionoftheframeworktomanageitsjobsandthatwhichimplementsitstasks.Itdefinesanabstractkey-valuestorethatcanhavemultipleconcreteimplementations.Samzausesexistingopensourceprojectsfortheon-diskimplementationsandusedLevelDBasofv0.7andaddedRocksDBasofv0.8.Thereisalsoanin-memorystorethatdoesnotpersistthekey-valuedatabutthatmaybeusefulintestingorpotentiallyveryspecificproductionworkloads.

Eachtaskcanwritetothiskey-valuestoreandSamzamanagesitspersistencetothelocalimplementation.Tosupportpersistentstates,thestoreisalsomodeledasastreamandallwritestothestorearealsopushedintoastream.Ifataskfails,thenonrestart,itcanrecoverthestateofitslocalkey-valuestorebyreplayingthemessagesinthebackingtopic.Anobviousconcernherewillbethenumberofmessagesthatneedtobereplayed;however,whenusingKafka,forexample,itcompactsmessageswiththesamekeysothatonlythelatestupdateremainsinthetopic.

We’llmodifyourprevioustweetsentimentexampletoaddalifetimecountofthemaximumpositiveandnegativesentimentseeninanytweet.ThefollowingcodecanbefoundasTwitterStatefulSentimentStateTask.javaathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/java/com/learninghadoop2/samza/tasks/TwitterStatefulSentimentStreamTask.javaNotethattheprocessmethodisthesameasTwitterSentimentStateTask.java,sowehaveomittedithereforspacereasons:

publicclassTwitterStatefulSentimentStreamTaskimplementsStreamTask,

WindowableTask,InitableTask{

privateSet<String>positiveWords=newHashSet<String>();

privateSet<String>negativeWords=newHashSet<String>();

privateinttweets=0;

privateintpositiveTweets=0;

privateintnegativeTweets=0;

privateintmaxPositive=0;

privateintmaxNegative=0;

privateKeyValueStore<String,Integer>store;

Page 207: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

@SuppressWarnings("unchecked")

@Override

publicvoidinit(Configconfig,TaskContextcontext){

this.store=(KeyValueStore<String,Integer>)

context.getStore("tweet-store");

}

@Override

publicvoidprocess(IncomingMessageEnvelopeenvelope,MessageCollector

collector,TaskCoordinatorcoordinator){

...

}

@Override

publicvoidwindow(MessageCollectorcollector,TaskCoordinator

coordinator){

IntegerlifetimeMaxPositive=store.get("lifetimeMaxPositive");

IntegerlifetimeMaxNegative=store.get("lifetimeMaxNegative");

if((lifetimeMaxPositive==null)||(maxPositive>

lifetimeMaxPositive)){

lifetimeMaxPositive=maxPositive;

store.put("lifetimeMaxPositive",lifetimeMaxPositive);

}

if((lifetimeMaxNegative==null)||(maxNegative>

lifetimeMaxNegative)){

lifetimeMaxNegative=maxNegative;

store.put("lifetimeMaxNegative",lifetimeMaxNegative);

}

Stringmsg=

String.format(

"Tweets:%dPositive:%dNegative:%dMaxPositive:%d

MaxNegative:%dLifetimeMaxPositive:%dLifetimeMaxNegative:%d",

tweets,positiveTweets,negativeTweets,maxPositive,

maxNegative,lifetimeMaxPositive,

lifetimeMaxNegative);

collector.send(newOutgoingMessageEnvelope(new

SystemStream("kafka","tweet-stateful-sentiment-stats"),msg));

//Resetcountsafterwindowing.

tweets=0;

positiveTweets=0;

negativeTweets=0;

maxPositive=0;

maxNegative=0;

}

}

ThisclassimplementsanewinterfacecalledInitableTask.Thishasasinglemethodcalledinitandisusedwhenataskneedstoconfigureaspectsofitsconfigurationbeforeitbeginsexecution.Weusetheinit()methodheretocreateaninstanceoftheKeyValueStoreclassandstoreitinaprivatemembervariable.

Page 208: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

KeyValueStore,asthenamesuggests,providesafamiliarput/gettypeinterface.Inthiscase,wespecifythatthekeysareofthetypeStringandthevaluesareIntegers.Inourwindowmethod,weretrieveanypreviouslystoredvaluesforthemaximumpositiveandnegativesentimentandifthecountinthecurrentwindowishigher,updatethestoreaccordingly.Then,wejustoutputtheresultsofthewindowmethodasbefore.

Asyoucansee,theuserdoesnotneedtodealwiththedetailsofeitherthelocalorremotepersistenceoftheKeyValueStoreinstance;thisisallhandledbySamza.Theefficiencyofthemechanismalsomakesittractablefortaskstoholdsizeableamountoflocalstate,whichcanbeparticularlyvaluableincasessuchaslong-runningaggregationsorstreamjoins.

Theconfigurationfileforthejobcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch4/src/main/resources/twitter-stateful-sentiment.properties.Itneedstohaveafewentriesadded,whichareasfollows:

stores.tweet-

store.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory

stores.tweet-store.changelog=kafka.twitter-stats-state

stores.tweet-store.key.serde=string

stores.tweet-store.msg.serde=integer

Thefirstlinespecifiestheimplementationclassforthestore,thesecondlinespecifiestheKafkatopictobeusedforpersistentstate,andthelasttwolinesspecifythetypeofthestorekeyandvalue.

Torunthisjob,usethefollowingcommand:

$./gradlewrunTwitterStatefulSentiment

Forconvenience,thefollowingcommandwillstartupfourjobs:theJSONparser,thetextcleanup,thestatisticsjobandthestatefulsentimentjobs:

$./gradlewrunTasks

Samzaisapurestream-processingsystemthatprovidespluggableimplementationsofitsstorageandexecutionlayers.ThemostcommonlyusedpluginsareYARNandKafka,andthesedemonstratehowSamzacanintegratetightlywithHadoopYARNwhileusingacompletelydifferentstoragelayer.Samzaisstillarelativelynewprojectandthecurrentfeaturesareonlyasubsetofwhatisenvisaged.Itisrecommendedtoconsultitswebpagetogetthelatestinformationonitscurrentstatus.

Page 209: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryThischapterfocusedmuchmoreonwhatcanbedoneonHadoop2,andinparticularYARN,thanthedetailsofHadoopinternals.Thisisalmostcertainlyagoodthing,asitdemonstratesthatHadoopisrealizingitsgoalofbecomingamuchmoreflexibleandgenericdataprocessingplatformthatisnolongertiedtobatchprocessing.Inparticular,wehighlightedhowSamzashowsthattheprocessingframeworksthatcanbeimplementedonYARNcaninnovateandenablefunctionalityvastlydifferentfromthatavailableinHadoop1.

Inparticular,wesawhowSamzagoestotheoppositeendofthelatencyspectrumfrombatchprocessingandenablesper-messageprocessingofindividualmessagesastheyarrive.

WealsosawhowSamzaprovidesacallbackmechanismthatMapReducedeveloperswillbefamiliarwith,butusesitforaverydifferentprocessingmodel.WealsodiscussedthewaysinwhichSamzautilizesYARNasitsmainexecutionframeworkandhowitimplementsthemodeldescribedinChapter3,Processing–MapReduceandBeyond.

Inthenextchapter,wewillswitchgearsandexploreApacheSpark.ThoughithasaverydifferentdatamodelthanSamza,we’llseethatitdoesalsohaveanextensionthatsupportsprocessingofrealtimedatastreams,includingtheoptionofKafkaintegration.However,bothprojectsaresodifferentthattheyarecomplimentarymorethanincompetition.

Page 210: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter5.IterativeComputationwithSparkInthepreviouschapter,wesawhowSamzacanenablenearreal-timestreamdataprocessingwithinHadoop.ThisisquiteastepawayfromthetraditionalbatchprocessingmodelofMapReduce,butstillkeepswiththemodelofprovidingawell-definedinterfaceagainstwhichbusinesslogictaskscanbeimplemented.InthischapterwewillexploreApacheSpark,whichcanbeviewedbothasaframeworkonwhichapplicationscanbebuiltaswellasaprocessingframeworkinitsownright.NotonlyareapplicationsbeingbuiltonSpark,butentirecomponentswithintheHadoopecosystemarealsobeingreimplementedtouseSparkastheirunderlyingprocessingframework.Inparticular,wewillcoverthefollowingtopics:

WhatSparkisandhowitscoresystemcanrunonYARNThedatamodelprovidedbySparkthatenableshugelyscalableandhighlyefficientdataprocessingThebreadthofadditionalSparkcomponentsandrelatedprojects

It’simportanttonoteupfrontthatalthoughSparkhasitsownmechanismtoprocessstreamingdata,thisisbutonepartofwhatSparkhastooffer.It’sbesttothinkofitasamuchbroaderinitiative.

Page 211: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheSparkApacheSpark(https://spark.apache.org/)isadataprocessingframeworkbasedonageneralizationofMapReduce.ItwasoriginallydevelopedbytheAMPLabatUCBerkeley(https://amplab.cs.berkeley.edu/).LikeTez,SparkactsasanexecutionenginethatmodelsdatatransformationsasDAGsandstrivestoeliminatetheI/OoverheadofMapReduceinordertoperformiterativecomputationatscale.WhileTez’smaingoalwastoprovideafasterexecutionengineforMapReduceonHadoop,SparkhasbeendesignedbothasastandaloneframeworkandanAPIforapplicationdevelopment.Thesystemisdesignedtoperformgeneral-purposein-memorydataprocessing,streamworkflows,aswellasinteractiveanditerativecomputation.

SparkisimplementedinScala,whichisastaticallytypedprogramminglanguagefortheJavaVMandexposesnativeprogramminginterfacesforJavaandPythoninadditiontoScalaitself.NotethatthoughJavacodecancalltheScalainterfacedirectly,therearesomeaspectsofthetypesystemthatmakesuchcodeprettyunwieldy,andhenceweusethenativeJavaAPI.

ScalashipswithaninteractiveshellsimilartothatofRubyandPython;thisallowsuserstorunSparkinteractivelyfromtheinterpretertoqueryanydataset.

TheScalainterpreteroperatesbycompilingaclassforeachlinetypedbytheuser,loadingitintotheJVM,andinvokingafunctiononit.Thisclassincludesasingletonobjectthatcontainsthevariablesorfunctionsonthatlineandrunstheline’scodeinaninitializemethod.Inadditiontoitsrichprogramminginterfaces,Sparkisbecomingestablishedasanexecutionengine,withpopulartoolsoftheHadoopecosystem(suchasPigandHive)beingportedtotheframework.

Page 212: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClustercomputingwithworkingsetsSpark’sarchitectureiscenteredaroundtheconceptofResilientDistributedDatasets(RDDs),whichisaread-onlycollectionofScalaobjectspartitionedacrossasetofmachinesthatcanpersistinmemory.Thisabstractionwasproposedina2012researchpaper,ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,whichcanbefoundathttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.

ASparkapplicationconsistsofadriverprogramthatexecutesparalleloperationsonaclusterofworkersandlong-livedprocessesthatcanstoredatapartitionsinmemorybydispatchingfunctionsthatrunasparalleltasks,asshowninthefollowingdiagram:

Sparkclusterarchitecture

ProcessesarecoordinatedviaaSparkContextinstance.SparkContextconnectstoaresourcemanager(suchasYARN),requestsexecutorsonworkernodes,andsendstaskstobeexecuted.Executorsareresponsibleforrunningtasksandmanagingmemorylocally.

Sparkallowsyoutosharevariablesbetweentasks,orbetweentasksandthedriver,usinganabstractionknownassharedvariables.Sparksupportstwotypesofsharedvariables:broadcastvariables,whichcanbeusedtocacheavalueinmemoryonallnodes,andaccumulators,whichareadditivevariablessuchascountersandsums.

ResilientDistributedDatasets(RDDs)AnRDDisstoredinmemory,sharedacrossmachinesandisusedinMapReduce-likeparalleloperations.Faulttoleranceisachievedthroughthenotionoflineage:ifapartitionofanRDDislost,theRDDhasenoughinformationabouthowitwasderivedfromotherRDDstobeabletorebuildjustthatpartition.AnRDDcanbebuiltinfourways:

ByreadingdatafromafilestoredinHDFSBydividing–parallelizing–aScalacollectionintoanumberofpartitionsthataresenttoworkers

Page 213: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BytransforminganexistingRDDusingparalleloperatorsBychangingthepersistenceofanexistingRDD

SparkshineswhenRDDscanfitinmemoryandcanbecachedacrossoperations.TheAPIexposesmethodstopersistRDDsandallowsforseveralpersistencestrategiesandstoragelevels,allowingforspilltodiskaswellasspace-efficientbinaryserialization.

ActionsOperationsareinvokedbypassingfunctionstoSpark.Thesystemdealswithvariablesandsideeffectsaccordingtothefunctionalprogrammingparadigm.Closurescanrefertovariablesinthescopewheretheyarecreated.Examplesofactionsarecount(returnsthenumberofelementsinthedataset),andsave(outputsthedatasettostorage).OtherparalleloperationsonRDDsincludethefollowing:

map:appliesafunctiontoeachelementofthedatasetfilter:selectselementsfromadatasetbasedonuser-providedcriteriareduce:combinesdatasetelementsusinganassociativefunctioncollect:sendsallelementsofthedatasettothedriverprogramforeach:passeseachelementthroughauser-providedfunctiongroupByKey:groupsitemstogetherbyaprovidedkeysortByKey:sortsitemsbykey

Page 214: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DeploymentSparkcanrunbothinlocalmode,similartoaHadoopsingle-nodesetup,oratoparesourcemanager.Currentlysupportedresourcemanagersinclude:

SparkStandaloneClusterModeYARNApacheMesos

SparkonYARNAnad-hoc-consolidatedJARneedstobebuiltinordertodeploySparkonYARN.SparklaunchesaninstanceofthestandalonedeployedclusterwithintheResourceManager.ClouderaandMapRbothshipwithSparkonYARNaspartoftheirsoftwaredistribution.Atthetimeofwriting,SparkisavailableforHortonworks’sHDPasatechnologypreview(http://hortonworks.com/hadoop/spark/).

SparkonEC2Sparkcomeswithadeploymentscript,spark-ec2,locatedintheec2directory.ThisscriptautomaticallysetsupSparkandHDFSonaclusterofEC2instances.InordertolaunchaSparkclusterontheAmazoncloud,gototheec2directoryandrunthefollowingcommand:

./spark-ec2-k<keypair>-i<key-file>-s<num-slaves>launch<cluster-

name>

Here,<keypair>isthenameofyourEC2keypair,<key-file>istheprivatekeyfileforthekeypair,<num-slaves>isthenumberofslavenodestobelaunched,and<cluster-name>isthenametobegiventoyourcluster.SeeChapter1,Introduction,formoredetailsregardingthesetupofkeypairs,andverifythattheclusterschedulerisupandseesalltheslavesbygoingtoitswebUI,theaddressofwhichwillbeprintedoncethescriptcompletes.

YoucanspecifyapathinS3astheinputthroughaURIoftheforms3n://<bucket>/path.YouwillalsoneedtosetyourAmazonsecuritycredentials,eitherbysettingtheenvironmentvariablesAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYbeforeyourprogramisexecuted,orthroughSparkContext.hadoopConfiguration.

Page 215: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingstartedwithSparkSparkbinariesandsourcecodeareavailableontheprojectwebsiteathttp://spark.apache.org/.TheexamplesinthefollowingsectionhavebeentestedusingSpark1.1.0builtfromsourceontheClouderaCDH5.0QuickStartVM.

Downloadanduncompressthegziparchivewiththefollowingcommands:

$wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.1.0.tgz

$tarxvzfspark-1.1.0.tgz

$cdspark-1.1.0

SparkisbuiltonScala2.10andusessbt(https://github.com/sbt/sbt)tobuildthesourcecoreandrelatedexamples:

$./sbt/sbt-Dhadoop.version=2.2.0-Pyarnassembly

Withthe-Dhadoop.version=2.2.0and-Pyarnoptions,weinstructsbttobuildagainstHadoopversions2.2.0orhigherandenableYARNsupport.

StartSparkinstandalonemodewiththefollowingcommand:

$./sbin/start-all.sh

Thiscommandwilllaunchalocalmasterinstanceatspark://localhost:7077aswellasaworkernode.

Awebinterfacetothemasternodecanbeaccessedathttp://localhost:8080/andcanbeseeninthefollowingscreenshot:

Masternodewebinterface

Sparkcanruninteractivelythroughspark-shell,whichisamodifiedversionoftheScalashell.Asafirstexample,wewillimplementawordcountoftheTwitterdatasetweusedinChapter3,Processing-MapReduceandBeyond,usingtheScalaAPI.

Startaninteractivespark-shellsessionbyrunningthefollowingcommand:

$./bin/spark-shell

TheshellinstantiatesaSparkContextobject,sc,thatisresponsibleforhandlingdriverconnectionstoworkers.Wewilldescribeitssemanticslaterinthischapter.

Page 216: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Tomakethingsabiteasier,let’screateasampletextualdatasetthatcontainsonestatusupdateperline:

$stream.py-t-n1000>sample.txt

Then,copyittoHDFS:

$hdfsdfs-putsample.txt/tmp

Withinspark-shell,wefirstcreateanRDD-file-fromthesampledata:

valfile=sc.textFile("/tmp/sample.txt")

Then,weapplyaseriesoftransformationstocountthewordoccurrencesinthefile.Notethattheoutputofthetransformationchain-counts-isstillanRDD:

valcounts=file.flatMap(line=>line.split(""))

.map(word=>(word,1))

.reduceByKey((m,n)=>m+n)

Thischainoftransformationscorrespondstothemapandreducephasesthatwearefamiliarwith.Inthemapphase,weloadeachlineofthedataset(flatMap),tokenizeeachtweetintoasequenceofwords,counttheoccurrenceofeachword(map),andemit(key,value)pairs.Inthereducephase,wegroupbykey(word)andsumvalues(m,n)togethertoobtainwordcounts.

Finally,weprintthefirsttenelements,counts.take(10),totheconsole:

counts.take(10).foreach(println)

Page 217: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WritingandrunningstandaloneapplicationsSparkallowsstandaloneapplicationstobewrittenusingthreeAPIs:Scala,Java,andPython.

ScalaAPIThefirstthingaSparkdrivermustdoistocreateaSparkContextobject,whichtellsSparkhowtoaccessacluster.Afterimportingclassesandimplicitconversionsintoaprogram,asinthefollowing:

importorg.apache.spark.SparkContext

importorg.apache.spark.SparkContext._

TheSparkContextobjectcanbecreatedwiththefollowingconstructor:

newSparkContext(master,appName,[sparkHome])

ItcanalsobecreatedthroughSparkContext(conf),whichtakesaSparkConfobject.

ThemasterparameterisastringthatspecifiesaclusterURItoconnectto(suchasspark://localhost:7077)oralocalstringtoruninlocalmode.TheappNametermistheapplicationnamethatwillbeshownintheclusterwebUI.

ItisnotpossibletooverridethedefaultSparkContextclass,norisitpossibletocreateanewonewithinarunningSparkshell.ItishoweverpossibletospecifywhichmasterthecontextconnectstousingtheMASTERenvironmentvariable.Forexample,torunspark-shellonfourcores,usethefollowing:

$MASTER=local[4]./bin/spark-shell

JavaAPITheorg.apache.spark.api.javapackageexposesalltheSparkfeaturesavailableintheScalaversiontoJava.TheJavaAPIhasaJavaSparkContextclassthatreturnsinstancesoforg.apache.spark.api.java.JavaRDDandworkswithJavacollectionsinsteadofScalaones.

ThereareafewkeydifferencesbetweentheJavaandScalaAPIs:

Java7doesnotsupportanonymousorfirst-classfunctions;therefore,functionsmustbeimplementedbyextendingtheorg.apache.spark.api.java.function.Function,Function2,andotherclasses.AsofSparkversion1.0theAPIhasbeenrefactoredtosupportJava8lambdaexpressions.WithJava8,Functionclassescanbereplacedwithinlineexpressionsthatactasashorthandforanonymousfunctions.TheRDDmethodsreturnJavacollectionsKey-valuepairs,whicharesimplywrittenas(key,value)inScala,arerepresentedbythescala.Tuple2class.Tomaintaintypesafety,someRDDandfunctionmethods,suchasthosethathandlekeypairsanddoubles,areimplementedasspecializedclasses.

Page 218: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WordCountinJavaAnexampleofWordCountinJavaisincludedwiththeSparksourcecodedistributionatexamples/src/main/java/org/apache/spark/examples/JavaWordCount.java.

Firstofall,wecreateacontextusingtheJavaSparkContextclass:

JavaSparkContextsc=newJavaSparkContext(master,"JavaWordCount",

System.getenv("SPARK_HOME"),

JavaSparkContext.jarOfClass(JavaWordCount.class));

JavaRDD<String>data=sc.textFile(infile,1);

JavaRDD<String>words=data.flatMap(newFlatMapFunction<String,

String>(){

@Override

publicIterable<String>call(Strings){

returnArrays.asList(s.split(""));

}

});

JavaPairRDD<String,Integer>ones=words.map(newPairFunction<String,

String,Integer>(){

@Override

publicTuple2<String,Integer>call(Strings){

returnnewTuple2<String,Integer>(s,1);

}

});

JavaPairRDD<String,Integer>counts=ones.reduceByKey(new

Function2<Integer,Integer,Integer>(){

@Override

publicIntegercall(Integeri1,Integeri2){

returni1+i2;

}

});

WethenbuildanRDDfromtheHDFSlocationinfile.Inthefirststepofthetransformationchain,wetokenizeeachtweetinthedatasetandreturnalistofwords.WeuseaninstanceofJavaPairRDD<String,Integer>tocountoccurrencesofeachword.Finally,wereducetheRDDtoanewJavaPairRDD<String,Integer>instancethatcontainsalistoftuples,eachrepresentingawordandthenumberoftimesitwasfoundinthedataset.

PythonAPIPySparkrequiresPythonversion2.6orhigher.RDDssupportthesamemethodsastheirScalacounterpartsbuttakePythonfunctionsandreturnPythoncollectiontypes.Lambdasyntax(https://docs.python.org/2/reference/expressions.html)isusedtopassfunctionstoRDDs.

ThewordcountinpysparkisrelativelysimilartoitsScalacounterpart:

tweets=sc.textFile("/tmp/sample.txt")

counts=tweets.flatMap(lambdatweet:tweet.split(''))\

Page 219: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

.map(lambdaword:(word,1))\

.reduceByKey(lambdam,n:m+n)

Thelambdaconstructcreatesanonymousfunctionsatruntime.lambdatweet:tweet.split('')createsafunctionthattakesastringtweetastheinputandoutputsalistofstringssplitbywhitespace.Spark’sflatMapappliesthisfunctiontoeachlineofthetweetsdataset.Inthemapphase,foreachwordtoken,lambdaword:(word,1)returns(word,1)tuplesthatindicatetheoccurrenceofawordinthedataset.InreduceByKey,wegroupthesetuplesbykey-word-andsumthevaluestogethertoobtainthewordcountwithlambdam,n:m+n.

Page 220: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheSparkecosystemApacheSparkpowersanumberoftools,bothasalibraryandasanexecutionengine.

Page 221: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SparkStreamingSparkStreaming(foundathttp://spark.apache.org/docs/latest/streaming-programming-guide.html)isanextensionoftheScalaAPIthatallowsdataingestionfromstreamssuchasKafka,Flume,Twitter,ZeroMQ,andTCPsockets.

SparkStreamingreceivesliveinputdatastreamsanddividesthedataintobatches(arbitrarilysizedtimewindows),whicharethenprocessedbytheSparkcoreenginetogeneratethefinalstreamofresultsinbatches.Thishigh-levelabstractioniscalledDStream(org.apache.spark.streaming.dstream.DStreams)andisimplementedasasequenceofRDDs.DStreamallowsfortwokindsofoperations:transformationsandoutputoperations.TransformationsworkononeormoreDStreamstocreatenewDStreams.Aspartofachainoftransformations,datacanbepersistedeithertoastoragelayer(HDFS)oranoutputchannel.SparkStreamingallowsfortransformationsoveraslidingwindowofdata.Awindow-basedoperationneedstospecifytwoparameters:thewindowlength,thedurationofthewindowandtheslideinterval,theintervalatwhichthewindow-basedoperationisperformed.

Page 222: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GraphXGraphX(foundathttps://spark.apache.org/docs/latest/graphx-programming-guide.html)isanAPIforgraphcomputationthatexposesasetofoperatorsandalgorithmsforgraph-orientedcomputationaswellasanoptimizedvariantofPregel.

Page 223: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MLlibMLlib(foundathttp://spark.apache.org/docs/latest/mllib-guide.html)providescommonMachineLearning(ML)functionality,includingtestsanddatagenerators.MLlibcurrentlysupportsfourtypesofalgorithms:binaryclassification,regression,clustering,andcollaborativefiltering.

Page 224: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SparkSQLSparkSQLisderivedfromShark,whichisanimplementationoftheHivedatawarehousingsystemthatusesSparkasanexecutionengine.WewilldiscussHiveinChapter7,HadoopandSQL.WithSparkSQL,itispossibletomixSQL-likequerieswithScalaorPythoncode.TheresultsetsreturnedbyaqueryarethemselvesRDDs,andassuch,theycanbemanipulatedbySparkcoremethodsorMLlibandGraphX.

Page 225: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ProcessingdatawithApacheSparkInthissection,wewillimplementtheexamplesfromChapter3,Processing–MapReduceandBeyond,usingtheScalaAPI.Wewillconsiderboththebatchandreal-timeprocessingscenarios.WewillshowyouhowSparkStreamingcanbeusedtocomputestatisticsontheliveTwitterstream.

Page 226: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BuildingandrunningtheexamplesScalasourcecodefortheexamplescanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch5.Wewillbeusingsbttobuild,manage,andexecutecode.

Thebuild.sbtfilecontrolsthecodebasemetadataandsoftwaredependencies;theseincludetheversionoftheScalainterpreterthatSparklinksto,alinktotheAkkapackagerepositoryusedtoresolveimplicitdependencies,aswellasdependenciesonSparkandHadooplibraries.

Thesourcecodeforallexamplescanbecompiledwith:

$sbtcompile

Or,itcanbepackagedintoaJARfilewith:

$sbtpackage

Ahelperscripttoexecutecompiledclassescanbegeneratedwith:

$sbtadd-start-script-tasks

$sbtstart-script

Thehelpercanbeinvokedasfollows:

$target/start<classname><master><param1>…<paramn>

Here,<master>istheURIofthemasternode.AninteractiveScalasessioncanbeinvokedviasbtwiththefollowingcommand:

$sbtconsole

ThisconsoleisnotthesameastheSparkinteractiveshell;rather,itisanalternativewaytoexecutecode.InordertorunSparkcodeinitwewillneedtomanuallyimportandinstantiateaSparkContextobject.Allexamplespresentedinthissectionexpectatwitter4j.propertiesfilecontainingtheconsumerkeyandsecretandtheaccesstokenstobepresentinthesamedirectorywheresbtorspark-shellisbeinginvoked:

oauth.consumerKey=

oauth.consumerSecret=

oauth.accessToken=

oauth.accessTokenSecret=

RunningtheexamplesonYARNToruntheexamplesonaYARNgrid,wefirstbuildaJARfileusing:

$sbtpackage

Then,weshipittotheresourcemanagerusingthespark-submitcommand:

./bin/spark-submit--classapplication.to.execute--masteryarn-cluster

[options]target/scala-2.10/chapter-4_2.10-1.0.jar[<param1>…<paramn>]

Page 227: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Unlikethestandalonemode,wedon’tneedtospecifya<master>URI.InYARN,theResourceManagerisselectedfromtheclusterconfiguration.MoreinformationonlaunchingsparkinYARNcanbefoundathttp://spark.apache.org/docs/latest/running-on-yarn.html.

FindingpopulartopicsUnliketheearlierexampleswiththeSparkshellweinitializeaSparkContextaspartoftheprogram.WepassthreeargumentstotheSparkContextconstructor:thetypeofschedulerwewanttouse,anamefortheapplication,andthedirectorywhereSparkisinstalled:

importorg.apache.spark.SparkContext._

importorg.apache.spark.SparkContext

importscala.util.matching.Regex

objectHashtagCount{

defmain(args:Array[String]){

[…]

valsc=newSparkContext(master,

"HashtagCount",

System.getenv("SPARK_HOME"))

valfile=sc.textFile(inputFile)

valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")

valcounts=file.flatMap(line=>

(patternfindAllInline).toList)

.map(word=>(word,1))

.reduceByKey((m,n)=>m+n)

counts.saveAsTextFile(outputPath)

}

}

WecreateaninitialRDDfromadatasetstoredinHDFS-inputFile-andapplylogicthatissimilartotheWordCountexample.

Foreachtweetinthedataset,weextractanarrayofstringsthatmatchthehashtagpattern(patternfindAllInline).toArray,andwecountanoccurrenceofeachstringusingthemapoperator.ThisgeneratesanewRDDasalistoftuplesintheform:

(word,1),(word2,1),(word,1)

Finally,wecombinetogetherelementsofthisRDDusingthereduceByKey()method.WestoretheRDDgeneratedbythislaststepbackintoHDFSwithsaveAsTextFile.

Thecodeforthestandalonedrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagCount.scala

AssigningasentimenttotopicsThesourcecodeofthisexamplecanbefoundat

Page 228: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

https://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/HashTagSentiment.scalaandthecodeisasfollows:

importorg.apache.spark.SparkContext._

importorg.apache.spark.SparkContext

importscala.util.matching.Regex

importscala.io.Source

objectHashtagSentiment{

defmain(args:Array[String]){

[…]

valsc=newSparkContext(master,

"HashtagSentiment",

System.getenv("SPARK_HOME"))

valfile=sc.textFile(inputFile)

valpositive=Source.fromFile(positiveWordsPath)

.getLines

.filterNot(_startsWith";")

.toSet

valnegative=Source.fromFile(negativeWordsPath)

.getLines

.filterNot(_startsWith";")

.toSet

valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")

valcounts=file.flatMap(line=>(patternfindAllInline).map({

word=>(word,sentimentScore(line,positive,negative))

})).reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})

valsentiment=counts.map({hashtagScore=>

valhashtag=hashtagScore._1

valscore=hashtagScore._2

valnormalizedScore=score._1/score._2

(hashtag,normalizedScore)

})

sentiment.saveAsTextFile(outputPath)

}

}

First,wereadalistofpositiveandnegativewordsintoScalaSetobjectsandfilteroutcomments(stringsbeginningwith;).

Whenahashtagisfound,wecallafunction-sentimentScore-toestimatethesentimentexpressedbythatgiventext.ThisfunctionimplementsthesamelogicweusedinChapter3,Processing–MapReduceandBeyond,toestimatethesentimentofatweet.Ittakesasinputparametersthetweet’stext,str,andalistofpositiveandnegativewordsasSet[String]objects.Thereturnvalueisthedifferencebetweenthepositiveandnegativescoresandthenumberofwordsinthetweets.InSpark,werepresentthisreturnvalueasapairofDoubleandIntegerobjects:

Page 229: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

defsentimentScore(str:String,positive:Set[String],

negative:Set[String]):(Double,Int)={

varpositiveScore=0;varnegativeScore=0;

str.split("""\s+""").foreach{w=>

if(positive.contains(w)){positiveScore+=1;}

if(negative.contains(w)){negativeScore+=1;}

}

((positiveScore-negativeScore).toDouble,

str.split("""\s+""").length)

}

Wereducethemapoutputbyaggregatingbythekey(thehashtag).Inthisphase,weemitatriplemadeofthehashtag,thesumofthedifferencebetweenpositiveandnegativescores,andthenumberofwordspertweet.WeuseanadditionalmapsteptonormalizethesentimentscoreandstoretheresultinglistofhashtagandsentimentpairstoHDFS.

Page 230: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataprocessingonstreamsThepreviousexamplecanbeeasilyadjustedtoworkonareal-timestreamofdata.Inthisandthefollowingsection,wewillusespark-streaming-twittertoperformsomesimpleanalyticstasksonthereal-timefirehose:

valwindow=10

valssc=newStreamingContext(master,"TwitterStreamEcho",

Seconds(window),System.getenv("SPARK_HOME"))

valstream=TwitterUtils.createStream(ssc,auth)

valtweets=stream.map(tweet=>(tweet.getText()))

tweets.print()

ssc.start()

ssc.awaitTermination()

}

TheScalasourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/TwitterStreamEcho.scala

Thetwokeypackagesweneedtoimportare:

importorg.apache.spark.streaming.{Seconds,StreamingContext}

importorg.apache.spark.streaming.twitter._

WeinitializeanewStreamingContextssconalocalclusterusinga10-secondwindowandusethiscontexttocreateaDStreamoftweetswhosetextweprint.

Uponsuccessfulexecution,Twitter’sreal-timefirehosewillbeechoedintheterminalinbatchesof10secondsworthofdata.NoticethatthecomputationwillcontinueindefinitelybutcanbeinterruptedatanymomentbypressingCtrl+C.

TheTwitterUtilsobjectisawrappertotheTwitter4jlibrary(http://twitter4j.org/en/index.html)thatshipswithspark-streaming-twitter.AsuccessfulcalltoTwitterUtils.createStreamwillreturnaDStreamofTwitter4jobjects(TwitterInputDStream).Intheprecedingexample,weusedthegetText()methodtoextractthetweettext;however,noticethatthetwitter4jobjectexposesthefullTwitterAPI.Forinstance,wecanprintastreamofuserswiththefollowingcall:

valusers=stream.map(tweet=>(tweet.getUser().getId(),

tweet.getUser().getName()))

users.print()

StatemanagementSparkStreamingprovidesanadhocDStreamtokeepthestateofeachkeyinanRDDandtheupdateStateByKeymethodtomutatestate.

Wecanreusethecodeofthebatchexampletoassignandupdatesentimentscoresonstreams:

Page 231: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

objectStreamingHashTagSentiment{

[…]

valcounts=text.flatMap(line=>(patternfindAllInline)

.toList

.map(word=>(word,sentimentScore(line,positive,negative))))

.reduceByKey({(m,n)=>(m._1+n._1,m._2+n._2)})

valsentiment=counts.map({hashtagScore=>

valhashtag=hashtagScore._1

valscore=hashtagScore._2

valnormalizedScore=score._1/score._2

(hashtag,normalizedScore)

})

valstateDstream=sentiment

.updateStateByKey[Double](updateFunc)

stateDstream.print

ssc.checkpoint("/tmp/checkpoint")

ssc.start()

}

AstateDStreamiscreatedbycallinghashtagSentiment.updateStateByKey.

TheupdateFuncfunctionimplementsthestatemutationlogic,whichisacumulativesumofsentimentscoresoveraperiodoftime:

valupdateFunc=(values:Seq[Double],state:Option[Double])=>{

valcurrentScore=values.sum

valpreviousScore=state.getOrElse(0.0)

Some((currentScore+previousScore)*decayFactor)

}

decayFactorisaconstantvalue,lessthanorequaltozero,thatweusetoproportionallydecreasethescoreovertime.Intuitively,thiswillfadehashtagsiftheyarenottrendinganymore.SparkStreamingwritesintermediatedataforstatefuloperationstoHDFS,soweneedtocheckpointtheStreamingcontextwithssc.checkpoint.

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/StreamingHashTagSentiment.scala

Page 232: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataanalysiswithSparkSQLSparkSQLcaneasethetaskofrepresentingandmanipulatingstructureddata.WewillloadaJSONfileintoatemporarytableandcalculatesimplestatisticsbyblendingSQLstatementsandScalacode:

objectSparkJson{

[…]

valfile=sc.textFile(inputFile)

valsqlContext=neworg.apache.spark.sql.SQLContext(sc)

importsqlContext._

valtweets=sqlContext.jsonFile(inFile)

tweets.printSchema()

//RegistertheSchemaRDDasatable

tweets.registerTempTable("tweets")

valtext=sqlContext.sql("SELECTtext,user.idFROMtweets")

//Findthetenmostpopularhashtags

valpattern=newRegex("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")

valcounts=text.flatMap(sqlRow=>(patternfindAllIn

sqlRow(0).toString).toList)

.map(word=>(word,1))

.reduceByKey((m,n)=>m+n)

counts.registerTempTable("hashtag_frequency")

counts.printSchema

valtop10=sqlContext.sql("SELECT_1ashashtag,_2asfrequencyFROM

hashtag_frequencyorderbyfrequencydesclimit10")

top10.foreach(println)

}

Aswithpreviousexamples,weinstantiateaSparkContextscandloadthedatasetofJSONtweets.Wethencreateaninstanceoforg.apache.spark.sql.SQLContextbasedontheexistingsc.TheimportsqlContext._givesaccesstoallfunctionsandimplicitconventionsforsqlContext.Weloadthetweets’JSONdatasetusingsqlContext.jsonFile.TheresultingtweetsobjectisaninstanceofSchemaRDD,whichisanewtypeofRDDintroducedbySparkSQL.TheSchemaRDDclassisconceptuallysimilartoatableinarelationaldatabase;itiscomposedofRowobjectsandaschemathatdescribesthecontentineachRow.Wecanseetheschemaforatweetbycallingtweets.printSchema().Beforewe’reabletomanipulatetweetswithSQLstatements,weneedtoregisterSchemaRDDasatableintheSQLContext.WethenextractthetextfieldofaJSONtweetwithanSQLquery.NotethattheoutputofsqlContext.sqlisanRDDagain.Assuch,wecanmanipulateitusingSparkcoremethods.Inourcase,wereusethelogicusedinpreviousexamplestoextracthashtagsandcounttheiroccurrences.Finally,weregistertheresultingRDDasatable,hashtag_frequency,andorderhashtagsby

Page 233: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

frequencywithaSQLquery.

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SparkJson.scala.

SQLondatastreamsAtthetimeofwriting,aSQLContextcannotbedirectlyinstantiatedfromaStreamingContextobject.Itis,however,possibletoqueryaDStreambyregisteringaSchemaRDDforeachRDDinagivenstream:

objectSqlOnStream{

[…]

valssc=newStreamingContext(sc,Seconds(window))

valgson=newGson()

valdstream=TwitterUtils

.createStream(ssc,auth)

.map(gson.toJson(_))

valsqlContext=neworg.apache.spark.sql.SQLContext(sc)

importsqlContext._

dstream.foreachRDD(rdd=>{

rdd.foreach(println)

valjsonRDD=sqlContext.jsonRDD(rdd)

jsonRDD.registerTempTable("tweets")

jsonRDD.printSchema

sqlContext.sql(query)

})

ssc.checkpoint("/tmp/checkpoint")

ssc.start()

ssc.awaitTermination()

}

Inordertogetthetwoworkingtogether,wefirstcreateaSparkContextscthatweusetoinitializebothaStreamingContextsscandasqlContext.Asinpreviousexamples,weuseTwitterUtils.createStreamtocreateaDStreamRDDdstream.Inthisexample,weuseGoogle’sGsonJSONparsertoserializeeachtwitter4jobjecttoaJSONstring.ToexecuteSparkSQLqueriesonthestream,weregisteraSchemaRDDjsonRDDwithinadstream.foreachRDDloop.WeusethesqlContext.jsonRDDmethodtocreateanRDDfromabatchofJSONtweets.Atthispoint,wecanquerytheSchemaRDDusingthesqlContext.sqlmethod.

Thesourcecodeofthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch5/src/main/scala/com/learninghadoop2/spark/SqlOnStream.scala.

Page 234: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ComparingSamzaandSparkStreamingItisusefultocompareSamzaandSparkStreamingtohelpidentifytheareasinwhicheachcanbestbeapplied.Asithasbeenhopefullymadeclearinthisbook,thesetechnologiesareverymuchcomplimentary.EventhoughSparkStreamingmightappearcompetitivewithSamza,wefeelbothproductsoffercompellingadvantagesincertainareas.

Samzashineswhentheinputdataistrulyastreamofdiscreteeventsandyouwishtobuildprocessingthatoperatesonthistypeofinput.SamzajobsrunningonKafkacanhavelatenciesintheorderofmilliseconds.Thisprovidesaprogrammingmodelfocusedontheindividualmessagesandisthebetterfitfortruenearreal-timeprocessingapplications.Thoughitlackssupporttobuildtopologiesofcollaboratingjobs,itssimplemodelallowssimilarconstructstobebuiltand,perhapsmoreimportantly,beeasilyreasonedabout.Itsmodelofpartitioningandscalingalsofocusesonsimplicity,whichagainmakesaSamzaapplicationveryeasytounderstandandgivesitasignificantadvantagewhendealingwithsomethingasintrinsicallycomplexasreal-timedata.

Sparkismuchmorethanastreamingproduct.Itssupportforbuildingdistributeddatastructuresfromexistingdatasetsandusingpowerfulprimitivestomanipulatethesegivesittheabilitytoprocesslargedatasetsatahigherlevelofgranularity.OtherproductsintheSparkecosystembuildadditionalinterfacesorabstractionsuponthiscommonbatchprocessingcore.ThisisverymuchadifferentfocustothemessagestreammodelofSamza.

ThisbatchmodelisalsodemonstratedwhenwelookatSparkStreaming;insteadofaper-messageprocessingmodel,itslicesthemessagestreamintoaseriesofRDDs.Withafastexecutionengine,thismeanslatenciesaslowas1second(http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf).Forworkloadsthatwishtoanalyzethestreaminsuchaway,thiswillbeabetterfitthanSamza’sper-messagemodel,whichrequiresadditionallogictoprovidesuchwindowing.

Page 235: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryThischapterexploredSparkandshowedyouhowitaddsiterativeprocessingasanewrichframeworkuponwhichapplicationscanbebuiltatopYARN.Inparticular,wehighlighted:

Thedistributeddata-structure-basedprocessingmodelofSparkandhowitallowsveryefficientin-memorydataprocessingThebroaderSparkecosystemandhowmultipleadditionalprojectsarebuiltatopittospecializethecomputationalmodelevenfurther

InthenextchapterwewillexploreApachePiganditsprogramminglanguage,PigLatin.WewillseehowthistoolcangreatlysimplifysoftwaredevelopmentforHadoopbyabstractingawaysomeoftheMapReduceandSparkcomplexity.

Page 236: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter6.DataAnalysiswithApachePigInthepreviouschapters,weexploredanumberofAPIsfordataprocessing.MapReduce,Spark,TezandSamzaareratherlow-level,andwritingnon-trivialbusinesslogicwiththemoftenrequiressignificantJavadevelopment.Moreover,differentuserswillhavedifferentneeds.ItmightbeimpracticalforananalysttowriteMapReducecodeorbuildaDAGofinputsandoutputstoanswersomesimplequeries.Atthesametime,asoftwareengineeroraresearchermightwanttoprototypeideasandalgorithmsusinghigh-levelabstractionsbeforejumpingintolow-levelimplementationdetails.

Inthischapterandthefollowingone,wewillexploresometoolsthatprovideawaytoprocessdataonHDFSusinghigher-levelabstractions.InthischapterwewillexploreApachePig,and,inparticular,wewillcoverthefollowingtopics:

WhatApachePigisandthedataflowmodelitprovidesPigLatin’sdatatypesandfunctionsHowPigcanbeeasilyenhancedusingcustomusercodeHowwecanusePigtoanalyzetheTwitterstream

Page 237: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AnoverviewofPigHistorically,thePigtoolkitconsistedofacompilerthatgeneratedMapReduceprograms,bundledtheirdependencies,andexecutedthemonHadoop.PigjobsarewritteninalanguagecalledPigLatinandcanbeexecutedinbothinteractiveandbatchfashions.Furthermore,PigLatincanbeextendedusingUserDefinedFunctions(UDFs)writteninJava,Python,Ruby,Groovy,orJavaScript.

Pigusecasesincludethefollowing:

DataprocessingAdhocanalyticalqueriesRapidprototypingofalgorithmsExtractTransformLoadpipelines

Followingatrendwehaveseeninpreviouschapters,Pigismovingtowardsageneral-purposecomputingarchitecture.Asofversion0.13theExecutionEngineinterface(org.apache.pig.backend.executionengine)actsasabridgebetweenthefrontendandthebackendofPig,allowingPigLatinscriptstobecompiledandexecutedonframeworksotherthanMapReduce.Atthetimeofwriting,version0.13shipswithMRExecutionEngine(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRExecutionEngineandworkonalow-latencybackendbasedonTez(org.apache.pig.backend.hadoop.executionengine.tez.*)isexpectedtobeincludedinversion0.14(seehttps://issues.apache.org/jira/browse/PIG-3446).WorkonintegratingSparkiscurrentlyinprogressinthedevelopmentbranch(seehttps://issues.apache.org/jira/browse/PIG-4059).

Pig0.13comeswithanumberofperformanceenhancementsfortheMapReducebackend,inparticulartwofeaturestoreducelatencyofsmalljobs:directHDFSaccess(https://issues.apache.org/jira/browse/PIG-3642)andautolocalmode(https://issues.apache.org/jira/browse/PIG-3463).DirectHDFS,theopt.fetchproperty,isturnedonbydefault.WhendoingaDUMPinasimple(map-only)scriptthatcontainsonlyLIMIT,FILTER,UNION,STREAM,orFOREACHoperators,inputdataisfetchedfromHDFS,andthequeryisexecuteddirectlyinPig,bypassingMapReduce.Withautolocal,thepig.auto.local.enabledproperty,PigwillrunaqueryintheHadooplocalmodewhenthedatasizeissmallerthanpig.auto.local.input.maxbytes.Autolocalisoffbydefault.

PigwilllaunchMapReducejobsifbothmodesareofforifthequeryisnoteligibleforeither.Ifbothmodesareon,Pigwillcheckwhetherthequeryiseligiblefordirectaccessand,ifnot,fallbacktoautolocal.Failingthat,itwillexecutethequeryonMapReduce.

Page 238: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingstartedWewillusethestream.pyscriptoptionstoextractJSONdataandretrieveaspecificnumberoftweets;wecanrunthiswithacommandsuchasthefollowing:

$pythonstream.py-j-n10000>tweets.json

Thetweets.jsonfilewillcontainoneJSONstringoneachlinerepresentingatweet.

RememberthattheTwitterAPIcredentialsneedtobemadeavailableasenvironmentvariablesorhardcodedinthescriptitself.

Page 239: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

RunningPigPigisatoolthattranslatesstatementswritteninPigLatinandexecutesthemeitheronasinglemachineinstandalonemodeoronafullHadoopclusterwhenindistributedmode.Eveninthelatter,Pig’sroleistotranslatePigLatinstatementsintoMapReducejobsandthereforeitdoesn’trequiretheinstallationofadditionalservicesordaemons.Itisusedasacommand-linetoolwithitsassociatedlibraries.

ClouderaCDHshipswithApachePigversion0.12.Alternatively,thePigsourcecodeandbinarydistributionscanbeobtainedathttps://pig.apache.org/releases.html.

Ascanbeexpected,theMapReducemoderequiresaccesstoaHadoopclusterandHDFSinstallation.MapReducemodeisthedefaultmodeexecutedwhenrunningthePigcommandatthecommand-lineprompt.Scriptscanbeexecutedwiththefollowingcommand:

$pig-f<script>

Parameterscanbepassedviathecommandlineusing-param<param>=<val>,asfollows:

$pig–paraminput=tweets.txt

ParameterscanalsobespecifiedinaparamfilethatcanbepassedtoPigusingthe-param_file<file>option.Multiplefilescanbespecified.Ifaparameterispresentmultipletimesinthefile,thelastvaluewillbeusedandawarningwillbedisplayed.Aparameterfilecontainsoneparameterperline.Emptylinesandcomments(specifiedbystartingalinewith#)areallowed.WithinaPigscript,parametersareintheform$<parameter>.Thedefaultvaluecanbeassignedusingthedefaultstatement:%defaultinputtweets.json'.ThedefaultcommandwillnotworkwithinaGruntsession;we’lldiscussGruntinthenextsection.

Inlocalmode,allfilesareinstalledandrunusingthelocalhostandfilesystem.Specifylocalmodeusingthe-xflag:

$pig-xlocal

Inbothexecutionmodes,Pigprogramscanberuneitherinaninteractiveshellorinbatchmode.

Page 240: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Grunt–thePiginteractiveshellPigcanruninaninteractivemodeusingtheGruntshell,whichisinvokedwhenweusethepigcommandattheterminalprompt.Intherestofthischapter,wewillassumethatexamplesareexecutedwithinaGruntsession.OtherthanexecutingPigLatinstatements,Gruntoffersanumberofutilitiesandaccesstoshellcommands:

fs:allowsuserstomanipulateHadoopfilesystemobjectsandhasthesamesemanticsastheHadoopCLIsh:executescommandsviatheoperatingsystemshellexec:launchesaPigscriptwithinaninteractiveGruntsessionkill:killsaMapReducejobhelp:printsalistofallavailablecommands

ElasticMapReducePigscriptscanbeexecutedonEMRbycreatingaclusterwith--applicationsName=Pig,Args=--version,<version>,asfollows:

$awsemrcreate-cluster\

--name"Pigcluster"\

--ami-version<amiversion>\

--instance-type<EC2instance>\

--instance-count<numberofnodes>\

--applicationsName=Pig,Args=--version,<version>\

--log-uri<S3bucket>\

--stepsType=PIG,\

Name="Pigscript",\

Args=[-f,s3://<scriptlocation>,\

-p,input=<inputparam>,\

-p,output=<outputparam>]

TheprecedingcommandwillprovisionanewEMRclusterandexecutes3://<scriptlocation>.Noticethatthescriptstobeexecutedandtheinput(-pinput)andoutput(-poutput)pathsareexpectedtobelocatedonS3.

AsanalternativetocreatinganewEMRcluster,itispossibletoaddPigstepstoanalready-instantiatedEMRclusterusingthefollowingcommand:

$awsemradd-steps\

--cluster-id<clusterid>\

--stepsType=PIG,\

Name="OtherPigscript",\

Args=[-f,s3://<scriptlocation>,\

-p,input=<inputparam>,\

-p,output=<outputparam>]

Intheprecedingcommand,<clusterid>istheIDoftheinstantiatedcluster.

ItisalsopossibletosshintothemasternodeandrunPigLatinstatementswithinaGruntsessionwiththefollowingcommand:

$awsemrssh--cluster-id<clusterid>--key-pair-file<keypair>

Page 241: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,
Page 242: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

FundamentalsofApachePigTheprimaryinterfacetoprogramApachePigisPigLatin,aprocedurallanguagethatimplementsideasofthedataflowparadigm.

PigLatinprogramsaregenerallyorganizedasfollows:

ALOADstatementreadsdatafromHDFSAseriesofstatementsaggregatesandmanipulatesdataASTOREstatementwritesoutputtothefilesystemAlternatively,aDUMPstatementdisplaystheoutputtotheterminal

Thefollowingexampleshowsasequenceofstatementsthatoutputsthetop10hashtagsorderedbythefrequency,extractedfromthedatasetoftweets:

tweets=LOAD'tweets.json'

USINGJsonLoader('created_at:chararray,

id:long,

id_str:chararray,

text:chararray');

hashtags=FOREACHtweets{

GENERATEFLATTEN(

REGEX_EXTRACT(

text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)',1)

)astag;

}

hashtags_grpd=GROUPhashtagsBYtag;

hashtags_count=FOREACHhashtags_grpd{

GENERATE

group,

COUNT(hashtags)asoccurrencies;

}

hashtags_count_sorted=ORDERhashtags_countBYoccurrenciesDESC;

top_10_hashtags=LIMIThashtags_count_sorted10;

DUMPtop_10_hashtags;

First,weloadthetweets.jsondatasetfromHDFS,de-serializetheJSONfile,andmapittoafour-columnschemathatcontainsatweet’screationtime,itsIDinnumericalandstringform,andthetext.Foreachtweet,weextracthashtagsfromitstextusingaregularexpression.Weaggregateonhashtag,countthenumberofoccurrences,andorderbyfrequency.Finally,welimittheorderedrecordstothetop10mostfrequenthashtags.

AseriesofstatementslikethepreviousoneispickedupbythePigcompiler,transformedintoMapReducejobs,andexecutedonaHadoopcluster.Theplannerandoptimizerwillresolvedependenciesoninputandoutputrelationsandparallelizetheexecutionofstatementswhereverpossible.

StatementsarethebuildingblocksofprocessingdatawithPig.Theytakearelationasinputandproduceanotherrelationasoutput.InPigLatinterms,arelationcanbedefined

Page 243: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

asabagoftuples,twodatatypeswewillusethroughouttheremainderofthischapter.

UsersexperiencedwithSQLandtherelationaldatamodelmightfindPigLatin’ssyntaxsomewhatfamiliar.Whilethereareindeedsimilaritiesinthesyntaxitself,PigLatinimplementsanentirelydifferentcomputationalmodel.PigLatinisprocedural,itspecifiestheactualdatatransformstobeperformed,whereasSQLisdeclarativeanddescribesthenatureoftheproblembutdoesnotspecifytheactualruntimeprocessing.Intermsoforganizingdata,arelationcanbethoughtofasatableinarelationaldatabase,wheretuplesinabagcorrespondtotherowsinatable.Relationsareunorderedandthereforeeasilyparallelizable,andtheyarelessconstrainedthanrelationaltables.Pigrelationscancontaintupleswithdifferentnumbersoffields,andthosewiththesamefieldcountcanhavefieldsofdifferenttypesincorrespondingpositions.

AkeydifferencebetweenSQLandthedataflowmodeladoptedbyPigLatinliesinhowsplitsinadatapipelinearemanaged.Intherelationalworld,adeclarativelanguagesuchasSQLimplementsandexecutesqueriesthatwillgenerateasingleresult.Thedataflowmodelseesdatatransformationsasagraphwhereinputandoutputarenodesconnectedbyanoperator.Forinstance,intermediatestepsofaquerymightrequiretheinputtobegroupedbyanumberofkeysandresultinmultipleoutputs(GROUPBY).Pighasbuilt-inmechanismstomanagemultipledataflowsinsuchagraphbyexecutingoperatorsassoonasinputsarereadilyavailableandpotentiallyapplydifferentoperatorstoeachflow.Forinstance,Pig’simplementationoftheGROUPBYoperatorusestheparallelfeature(http://pig.apache.org/docs/r0.12.0/perf.html#parallel)toallowausertoincreasethenumberofreducetasksfortheMapReducejobsgeneratedandhenceincreasesconcurrency.Anadditionalsideeffectofthispropertyisthatwhenmultipleoperatorscanbeexecutedinparallelinthesameprogram,Pigdoesso(moredetailsonPig’smulti-queryimplementationcanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#multi-query-execution).AnotherconsequenceofPigLatin’sapproachtocomputationisthatitallowsthepersistenceofdataatanypointinthepipeline.Itallowsthedevelopertoselectspecificoperatorimplementationsandexecutionplanswhennecessary,effectivelyoverridingtheoptimizer.

PigLatinallowsandevenencouragesdeveloperstoinserttheirowncodealmostanywhereinapipelinebymeansofUserDefinedFunctions(UDFs)aswellasbyutilizingHadoopstreaming.UDFsallowuserstospecifycustombusinesslogiconhowdataisloaded,howitisstored,andhowitisprocessed,whereasstreamingallowsuserstolaunchexecutablesatanypointinthedataflow.

Page 244: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ProgrammingPigPigLatincomeswithanumberofbuilt-infunctions(theeval,load/store,math,string,bag,andtuplefunctions)andanumberofscalarandcomplexdatatypes.Additionally,Pigallowsfunctionanddata-typeextensionbymeansofUDFsanddynamicinvocationofJavamethods.

Page 245: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PigdatatypesPigsupportsthefollowingscalardatatypes:

int:asigned32-bitintegerlong:asigned64-bitintegerfloat:a32-bitfloatingpointdouble:a64-bitfloatingpointchararray:acharacterarray(string)inUnicodeUTF-8formatbytearray:abytearray(blob)boolean:abooleandatetime:adatetimebiginteger:aJavaBigIntegerbigdecimal:aJavaBigDecimal

Pigsupportsthefollowingcomplexdatatypes:

map:anassociativearrayenclosedby[],withthekeyandvalueseparatedby#,anditemsseparatedby,tuple:anorderedlistofdata,whereelementscanbeofanyscalarorcomplextypeenclosedby(),withitemsseparatedby,bag:anunorderedcollectionoftuplesenclosedby{}andseparatedby,

Bydefault,Pigtreatsdataasuntyped.Theusercandeclarethetypesofdataatloadtimeormanuallycastitwhennecessary.Ifadatatypeisnotdeclared,butascriptimplicitlytreatsavalueasacertaintype,Pigwillassumeitisofthattypeandcastitaccordingly.Thefieldsofabagortuplecanbereferredtobythenametuple.fieldorbytheposition$<index>.Pigcountsfrom0andhencethefirstelementwillbedenotedas$0.

Page 246: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PigfunctionsBuilt-infunctionsareimplementedinJava,andtheytrytofollowstandardJavaconventions.Therearehoweveranumberofdifferencestokeepinmind,whichareasfollows:

FunctionnamesarecasesensitiveanduppercaseIftheresultvalueisnull,empty,ornotanumber(NaN),PigreturnsnullIfPigisunabletoprocesstheexpression,itreturnsanexception

Alistofallbuilt-infunctionscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html.

Load/storeLoad/storefunctionsdeterminehowdatagoesintoandcomesoutofPig.ThePigStorage,TextLoader,andBinStoragefunctionscanbeusedtoreadandwriteUTF-8delimited,unstructuredtext,andbinarydatarespectively.Supportforcompressionisdeterminedbytheload/storefunction.ThePigStorageandTextLoaderfunctionssupportgzipandbzip2compressionforbothread(load)andwrite(store).TheBinStoragefunctiondoesnotsupportcompression.

Asofversion0.12,Pigincludesbuilt-insupportforloadingandstoringAvroandJSONdataviatheAvroStorage(load/store),JsonStorage(store),andJsonLoader(load).Atthetimeofwriting,JSONsupportisstillsomewhatlimited.Inparticular,PigexpectsaschemaforthedatatobeprovidedasanargumenttoJsonLoader/JsonStorage,oritassumesthat.pig_schema(producedbyJsonStorage)ispresentinthedirectorycontainingtheinputdata.Inpractice,thismakesitdifficulttoworkwithJSONdumpsnotgeneratedbyPigitself.

Asseeninourfollowingexample,wecanloadtheJSONdatasetwithJsonLoader:

tweets=LOAD'tweets.json'USINGJsonLoader(

'created_at:chararray,

id:long,

id_str:chararray,

text:chararray,

source:chararray');

WeprovideaschemasothatthefirstfiveelementsofaJSONobjectcreated_id,id,id_str,text,andsourcearemapped.Wecanlookattheschemaoftweetsbyusingdescribetweets,whichreturnsthefollowing:

tweets:{created_at:chararray,id:long,id_str:chararray,text:

chararray,source:chararray}

EvalEvalfunctionsimplementasetofoperationstobeappliedonanexpressionthatreturnsabagormapdatatype.Theexpressionresultisevaluatedwithinthefunctioncontext.

AVG(expression):computestheaverageofthenumericvaluesinasingle-column

Page 247: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

bagCOUNT(expression):countsallelementswithnon-nullvaluesinthefirstpositioninabagCOUNT_STAR(expression):countsallelementsinabagIsEmpty(expression):checkswhetherabagormapisemptyMAX(expression),MIN(expression),andSUM(expression):returnthemax,min,orthesumofelementsinabagTOKENIZE(expression):splitsastringandoutputsabagofwords

Thetuple,bag,andmapfunctionsThesefunctionsallowconversionfromandtothebag,tuple,andmaptypes.Theyincludethefollowing:

TOTUPLE(expression),TOMAP(expression),andTOBAG(expression):Thesecoerceexpressiontoatuple,map,orbagTOP(n,column,relation):Thisreturnsthetopntuplesfromabagoftuples

Themath,string,anddatetimefunctionsPigexposesanumberoffunctionsprovidedbythejava.lang.Math,java.lang.String,java.util.Date,andJoda-TimeDateTimeclass(foundathttp://www.joda.org/joda-time/).

DynamicinvokersDynamicinvokersallowtheexecutionofJavafunctionswithouthavingtowraptheminaUDF.Theycanbeusedforanystaticfunctionthat:

acceptsnoargumentsoracceptsacombinationofstring,int,long,double,float,orarraywiththesesametypesreturnsastring,int,long,double,orfloatvalue

OnlyprimitivescanbeusedfornumbersandJavaboxedclasses(suchasInteger)cannotbeusedasarguments.Dependingonthereturntype,aspecifickindofinvokermustbeused:InvokeForString,InvokeForInt,InvokeForLong,InvokeForDouble,orInvokeForFloat.Moredetailsregardingdynamicinvokerscanbefoundathttp://pig.apache.org/docs/r0.12.0/func.html#dynamic-invokers.

MacrosAsofversion0.9,PigLatin’spreprocessorsupportsmacroexpansion.MacrosaredefinedusingtheDEFINEstatement:

DEFINEmacro_name(param1,...,paramN)RETURNSoutput_bag{

pig_latin_statements

};

Themacroisexpandedinline,anditsparametersarereferencedinthePigLatinblockwithin{}.

Page 248: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ThemacrooutputrelationisgivenintheRETURNSstatements(output_bag).RETURNSvoidisusedforamacrowithnooutputrelation.

Wecandefineamacrotocountthenumberofrowsinarelation,asfollows:

DEFINEcount_rows(X)RETURNScnt{

grpd=group$Xall;

$cnt=foreachgrpdgenerateCOUNT($X);

};

WecanuseitinaPigscriptorGruntsessiontocountthenumberoftweets:

tweets_count=count_rows(tweets);

DUMPtweets_count;

Macrosallowustomakescriptsmodularbyhousingcodeinseparatefilesandimportingthemwhereneeded.Forexample,wecansavecount_rowsinafilecalledcount_rows.macroandlateronimportitwiththecommandimport'count_rows.macro'.

Macroshaveanumberoflimitations;inparticular,onlyPigLatinstatementsareallowedinsideamacro.ItisnotpossibletouseREGISTERstatementsandshellcommands,UDFsarenotallowed,andparametersubstitutioninsidethemacroisnotsupported.

Page 249: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WorkingwithdataPigLatinprovidesanumberofrelationaloperatorstocombinefunctionsandapplytransformationsondata.Typicaloperationsinadatapipelineconsistoffilteringrelations(FILTER),aggregatinginputsbasedonkeys(GROUP),generatingtransformationsbasedoncolumnsofdata(FOREACH),andjoiningrelations(JOIN)basedonsharedkeys.

Inthefollowingsections,wewillillustratesuchoperatorsonadatasetoftweetsgeneratedbyloadingJSONdata.

FilteringTheFILTERoperatorselectstuplesfromarelationbasedonanexpression,asfollows:

relation=FILTERrelationBYexpression;

Wecanusethisoperatortofiltertweetswhosetextmatchesthehashtagregularexpression,asfollows:

tweets_with_tag=FILTERtweetsBY

(text

MATCHES'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)'

);

AggregationTheGROUPoperatorgroupstogetherdatainoneormorerelationsbasedonanexpressionorakey,asfollows:

relation=GROUPrelationBYexpression;

Wecangrouptweetsbythesourcefieldintoanewrelationgrpd,asfollows:

grpd=GROUPtweetsBYsource;

Itispossibletogrouponmultipledimensionsbyspecifyingatupleasthekey,asfollows:

grpd=GROUPtweetsBY(created_at,source);

TheresultofaGROUPoperationisarelationthatincludesonetupleperuniquevalueofthegroupexpression.Thistuplecontainstwofields.Thefirstfieldisnamedgroupandisofthesametypeasthegroupkey.Thesecondfieldtakesthenameoftheoriginalrelationandisofthetypebag.Thenamesofbothfieldsaregeneratedbythesystem.

UsingtheALLkeyword,Pigwillaggregateacrossthewholerelation.TheGROUPtweetsALLschemewillaggregatealltuplesinthesamegroup.

Aspreviouslymentioned,PigallowsexplicithandlingoftheconcurrencyleveloftheGROUPoperatorusingthePARALLELoperator:

grpd=GROUPtweetsBY(created_at,id)PARALLEL10;

Intheprecedingexample,theMapReducejobgeneratedbythecompilerwillrun10concurrentreducetasks.Pighasaheuristicestimateofhowmanyreducerstouse.

Page 250: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Anotherwayofgloballyenforcingthenumberofreducetasksistousethesetdefault_parallel<n>command.

ForeachTheFOREACHoperatorappliesfunctionsoncolumns,asfollows:

relation=FOREACHrelationGENERATEtransformation;

TheoutputofFOREACHdependsonthetransformationapplied.

Wecanusetheoperatortoprojectthetextofalltweetsthatcontainahashtag,asfollows:

t=FOREACHtweets_with_tagGENERATEtext;

Wecanalsoapplyafunctiontotheprojectedcolumns.Forinstance,wecanusetheREGEX_TOKENIZEfunctiontospliteachtweetintowords,asfollows:

t=FOREACHtweets_with_tagGENERATEFLATTEN(TOKENIZE(text))asword;

TheFLATTENmodifierfurtherun-neststhebaggeneratedbyTOKENIZEintoatupleofwords.

JoinTheJOINoperatorperformsaninnerjoinoftwoormorerelationsbasedoncommonfieldvalues.Itssyntaxisasfollows:

relation=JOINrelation1BYexpression1,relation2BYexpression2;

Wecanuseajoinoperationtodetecttweetsthatcontainpositivewords,asfollows:

positive=LOAD'positive-words.txt'USINGPigStorage()as(w:chararray);

Filteroutthecomments,asfollows:

positive_words=FILTERpositiveBYNOTwMATCHES'^;.*';

positive_wordsisabagoftuples,eachcontainingaword.Wethentokenizethetweets’textandcreateanewbagof(id_str,word)tuplesasfollows:

id_words=FOREACHtweets{

GENERATE

id_str,

FLATTEN(TOKENIZE(text))asword;

}

Wejointhetworelationsonthewordfieldandobtainarelationofalltweetsthatcontainoneormorepositivewords,asfollows:

positive_tweets=JOINpositive_wordsBYw,id_wordsBYword;

Inthisstatement,wejoinpositive_wordsandid_wordsontheconditionthatid_words.wordisapositiveword.Thepositive_tweetsoperatorisabagintheformof{w:chararray,id_str:chararray,word:chararray}thatcontainsallelementsofpositive_wordsandid_wordsthatmatchthejoincondition.

Page 251: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WecancombinetheGROUPandFOREACHoperatortocalculatethenumberofpositivewordspertweet(withatleastonepositiveword).First,wegrouptherelationofpositivetweetsbythetweetID,andthenwecountthenumberofoccurrencesofeachIDintherelation,asfollows:

grpd=GROUPpositive_tweetsBYid_str;

score=FOREACHgrpdGENERATEFLATTEN(group),COUNT(positive_tweets);

TheJOINoperatorcanmakeuseoftheparallelizefeatureaswell,asfollows:

positive_tweets=JOINpositive_wordsBYw,id_wordsBYwordPARALLEL10

Theprecedingcommandwillexecutethejoinwith10reducertasks.

Itispossibletospecifytheoperator’sbehaviorwiththeUSINGkeywordfollowedbytheIDofaspecializedjoin.Moredetailscanbefoundathttp://pig.apache.org/docs/r0.12.0/perf.html#specialized-joins.

Page 252: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ExtendingPig(UDFs)FunctionscanbeapartofalmosteveryoperatorinPig.TherearetwomaindifferencesbetweenUDFsandbuilt-infunctions.First,UDFsneedtoberegisteredusingtheREGISTERkeywordinordertomakethemavailabletoPig.Secondly,theyneedtobequalifiedwhenused.PigUDFscancurrentlybeimplementedinJava,Python,Ruby,JavaScript,andGroovy.ThemostextensivesupportisprovidedforJavafunctions,whichallowyoutocustomizeallpartsoftheprocessincludingdataload/store,transformation,andaggregation.Additionally,JavafunctionsarealsomoreefficientbecausetheyareimplementedinthesamelanguageasPigandbecauseadditionalinterfacesaresupported,suchastheAlgebraicandAccumulatorinterfaces.Ontheotherhand,RubyandPythonAPIsallowmorerapidprototyping.

TheintegrationofUDFswiththePigenvironmentismainlymanagedbythefollowingtwostatementsREGISTERandDEFINE:

REGISTERregistersaJARfilesothattheUDFsinthefilecanbeused,asfollows:

REGISTER'piggybank.jar'

DEFINEcreatesanaliastoafunctionorastreamingcommand,asfollows:

DEFINEMyFunctionmy.package.uri.MyFunction

Theversion0.12ofPigintroducedthestreamingofUDFsasamechanismforwritingfunctionsusinglanguageswithnoJVMimplementation.

Page 253: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ContributedUDFsPig’scodebasehostsaUDFrepositorycalledPiggybank.OtherpopularcontributedrepositoriesareTwitter’sElephantBird(foundathttps://github.com/kevinweil/elephant-bird/)andApacheDataFu(foundathttp://datafu.incubator.apache.org/).

PiggybankPiggybankisaplaceforPiguserstosharetheirfunctions.SharedcodeislocatedintheofficialPigSubversionrepositoryfoundathttp://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/TheAPIdocumentationcanbefoundathttp://pig.apache.org/docs/r0.12.0/api/underthecontribsection.PiggybankUDFscanbeobtainedbycheckingoutandcompilingthesourcesfromtheSubversionrepositoryorbyusingtheJARfilethatshipswithbinaryreleasesofPig.InClouderaCDH,piggybank.jarisavailableat/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar.

ElephantBirdElephantBirdisanopensourcelibraryofallthingsHadoopusedinproductionatTwitter.Thislibrarycontainsanumberofserializationtools,custominputandoutputformats,writables,Pigload/storefunctions,andmoremiscellanea.

ElephantBirdshipswithanextremelyflexibleJSONloaderfunction,whichatthetimeofwriting,isthego-toresourceformanipulatingJSONdatainPig.

ApacheDataFuApacheDataFuPigcollectsanumberofanalyticalfunctionsdevelopedandcontributedbyLinkedIn.Theseincludestatisticalandestimationfunctions,bagandsetoperations,sampling,hashing,andlinkanalysis.

Page 254: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AnalyzingtheTwitterstreamInthefollowingexamples,wewillusetheimplementationofJsonLoaderprovidedbyElephantBirdtoloadandmanipulateJSONdata.WewillusePigtoexploretweetmetadataandanalyzetrendsinthedataset.Finally,wewillmodeltheinteractionbetweenusersasagraphanduseApacheDataFutoanalyzethissocialnetwork.

Page 255: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PrerequisitesDownloadtheelephant-bird-pig(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-pig/4.5/elephant-bird-pig-4.5.jar),elephant-bird-hadoop-compat(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-hadoop-compat/4.5/elephant-bird-hadoop-compat-4.5.jar),andelephant-bird-core(http://central.maven.org/maven2/com/twitter/elephantbird/elephant-bird-core/4.5/elephant-bird-core-4.5.jar)JARfilesfromtheMavencentralrepositoryandcopythemontoHDFSusingthefollowingcommand:

$hdfsdfs-puttarget/elephant-bird-pig-4.5.jarhdfs:///jar/

$hdfsdfs–puttarget/elephant-bird-hadoop-compat-4.5.jarhdfs:///jar/

$hdfsdfs–putelephant-bird-core-4.5.jarhdfs:///jar/

Page 256: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DatasetexplorationBeforedivingdeeperintothedataset,weneedtoregisterthedependenciestoElephantBirdandDataFu,asfollows:

REGISTER/opt/cloudera/parcels/CDH/lib/pig/datafu-1.1.0-cdh5.0.0.jar

REGISTER/opt/cloudera/parcels/CDH/lib/pig/lib/json-simple-1.1.jar

REGISTERhdfs:///jar/elephant-bird-pig-4.5.jar

REGISTERhdfs:///jar/elephant-bird-hadoop-compat-4.5.jar

REGISTERhdfs:///jar/elephant-bird-core-4.5.jar

Then,loadtheJSONdatasetoftweetsusingcom.twitter.elephantbird.pig.load.JsonLoader,asfollows:

tweets=LOAD'tweets.json'using

com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

com.twitter.elephantbird.pig.load.JsonLoaderdecodeseachlineoftheinputfiletoJSONandpassestheresultingmapofvaluestoPigasasingle-elementtuple.ThisenablesaccesstoelementsoftheJSONobjectwithouthavingtospecifyaschemaupfront.The–nestedLoadargumentinstructstheclasstoloadnesteddatastructures.

Page 257: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TweetmetadataIntheremainderofthechapter,wewillusemetadatafromtheJSONdatasettomodelthetweetstream.OneexampleofmetadataattachedtoatweetisthePlaceobject,whichcontainsgeographicalinformationabouttheuser’slocation.Placecontainsfieldsthatdescribeitsname,ID,country,countrycode,andmore.Afulldescriptioncanbefoundathttps://dev.twitter.com/docs/platform-objects/places.

place=FOREACHtweetsGENERATE(chararray)$0#'place'asplace;

Entitiesgiveinformationsuchasstructureddatafromtweets,URLs,hashtags,andmentions,withouthavingtoextractthemfromtext.Adescriptionofentitiescanbefoundathttps://dev.twitter.com/docs/entities.Thehashtagentityisanarrayoftagsextractedfromatweet.Eachentityhasthefollowingtwoattributes:

Text:isthehashtagtextIndices:isthecharacterpositionfromwhichthehashtagwasextracted

Thefollowingcodeusesentities:

hashtags_bag=FOREACHtweets{

GENERATE

FLATTEN($0#'entities'#'hashtags')astag;

}

Wethenflattenhashtags_bagtoextracteachhashtag’stext:

hashtags=FOREACHhashtags_bagGENERATEtag#'text'astopic;

Entitiesforuserobjectscontaininformationthatappearsintheuserprofileanddescriptionfields.Wecanextractthetweetauthor’sIDviatheuserfieldinthetweetmap:

users=FOREACHtweetsGENERATE$0#'user'#'id'asid;

Page 258: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DatapreparationTheSAMPLEbuilt-inoperatorselectsasetofntupleswithprobabilitypoutofthedataset,asfollows:

sampled=SAMPLEtweets0.01;

Theprecedingcommandwillselectapproximately1percentofthedataset.GiventhatSAMPLEisprobabilistic(http://en.wikipedia.org/wiki/Bernoulli_sampling),thereisnoguaranteethatthesamplesizewillbeexact.Moreoverthefunctionsampleswithreplacement,whichmeansthateachitemmightappearmorethanonce.

ApacheDataFuimplementsanumberofsamplingmethodsforcaseswherehavinganexactsamplesizeandnoreplacementisdesired(SimpleRandomSampling),samplingwithreplacement(SimpleRandomSampleWithReplacementVoteandSimpleRandomSampleWithReplacementElect),whenwewanttoaccountforsamplebias(WeightedRandomSampling),ortosampleacrossmultiplerelations(SampleByKey).

Wecancreateasampleofexactly1percentofthedataset,witheachitemhavingthesameprobabilityofbeingselected,usingSimpleRandomSample.

NoteTheactualguaranteeisasampleofsizeceil(p*n)withaprobabilityofatleast99percent.

First,wepassasamplingprobability0.01totheUDFconstructor:

DEFINESRSdatafu.pig.sampling.SimpleRandomSample('0.01');

andthebag,createdwith(GROUPtweetsALL),tobesampled:

sampled=FOREACH(GROUPtweetsALL)GENERATEFLATTEN(SRS(tweets));

TheSimpleRandomSampleUDFselectswithoutreplacement,whichmeansthateachitemwillappearonlyonce.

NoteWhichsamplingmethodtousedependsbothonthedataweareworkingwith,assumptionsonhowitemsaredistributed,thesizeofthedataset,andwhatwepracticallywanttoachieve.Ingeneral,whenwewanttoexploreadatasettoformulatehypotheses,SimpleRandomSamplecanbeagoodchoice.However,inseveralanalyticsapplications,itiscommontousemethodsthatassumereplacement(forexample,bootstrapping).

Notethatwhenworkingwithverylargedatasets,samplingwithreplacementandsamplingwithoutreplacementtendtobehavesimilarly.Theprobabilityofanitembeingselectedtwiceoutofapopulationofbillionsofitemswillbelow.

Page 259: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TopnstatisticsOneofthefirstquestionswemightwanttoaskishowfrequentcertainthingsare.Forinstance,wemightwanttocreateahistogramofthetop10topicsbythenumberofmentions.Similarly,wemightwanttofindthetop50countriesorthetop10users.Beforelookingattweetsdata,wewilldefineamacrosothatwecanapplythesameselectionlogictodifferentcollectionsofitems:

DEFINEtop_n(rel,col,n)

RETURNStop_n_items{

grpd=GROUP$relBY$col;

cnt_items=FOREACHgrpd

GENERATEFLATTEN(group),COUNT($rel)AScnt;

cnt_items_sorted=ORDERcnt_itemsBYcntDESC;

$top_n_items=LIMITcnt_items_sorted$n;

}

Thetop_nmethodtakesarelationrel,thecolumncolwewanttocount,andthenumberofitemstoreturnnasparameters.InthePigLatinblock,wefirstgrouprelbyitemsincol,countthenumberofoccurrencesofeachitem,sortthem,andselectthemostfrequentn.

Tofindthetop10Englishhashtags,wefilterthembylanguage,andextracttheirtext:

tweets_en=FILTERtweetsby$0#'lang'=='en';

hashtags_bag=FOREACHtweets{

GENERATE

FLATTEN($0#'entities'#'hashtags')AStag;

}

hashtags=FOREACHhashtags_bagGENERATEtag#'text'AStag;

Andapplythetop_nmacro:

top_10_hashtags=top_n(hashtags,tag,10);

Inordertobettercharacterizewhatistrendingandmakethisinformationmorerelevanttousers,wecandrilldownintothedatasetandlookathashtagspergeographiclocation.

First,wegeneratebagof(place,hashtag)tuples,asfollows:

hashtags_country_bag=FOREACHtweetsgenerate{

0#'place'asplace,

FLATTEN($0#'entities'#'hashtags')astag;

}

Andthen,weextractthecountrycodeandhashtagtext,asfollows:

hashtags_country=FOREACHhashtags_country_bag{

GENERATE

place#'country_code'asco,

tag#'text'astag;

}

Then,wecounthowmanytimeseachcountrycodeandhashtagappeartogether,as

Page 260: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

follows:

hashtags_country_frequency=FOREACH(GROUPhashtags_countryALL){

GENERATE

FLATTEN(group),

COUNT(hashtags_country)ascount;

}

Finally,wecountthetop10countriesperhashtagwiththeTOPfunction,asfollows:

hashtags_country_regrouped=GROUPhashtags_country_frequencyBYcnt;

top_results=FOREACHhashtags_country_regrouped{

result=TOP(10,1,hashtags_country_frequency);

GENERATEFLATTEN(result);

}

TOP‘sparametersarethenumberoftuplestoreturn,thecolumntocompare,andtherelationcontainingsaidcolumn:

top_results=FOREACHD{

result=TOP(10,1,C);

GENERATEFLATTEN(result);

}

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/topn.pig.

Page 261: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DatetimemanipulationThecreated_atfieldintheJSONtweetsgivesustime-stampedinformationaboutwhenthetweetwasposted.Unfortunately,itsformatisnotcompatiblewithPig’sbuilt-indatetimetype.

PiggybankcomestotherescuewithanumberoftimemanipulationUDFscontainedinorg.apache.pig.piggybank.evaluation.datetime.convert.OneofthemisCustomFormatToISO,whichconvertsanarbitrarilyformattedtimestampintoanISO8601datetimestring.

InordertoaccesstheseUDFs,wefirstneedtoregisterthepiggybank.jarfile,asfollows:

REGISTER/opt/cloudera/parcels/CDH/lib/pig/piggybank.jar

Tomakeourcodelessverbose,wecreateanaliasfortheCustomFormatToISOclass’sfullyqualifiedJavaname:

DEFINECustomFormatToISO

org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();

Byknowinghowtomanipulatetimestamps,wecancalculatestatisticsatdifferenttimeintervals.Forinstance,wecanlookathowmanytweetsarecreatedperhour.Pighasabuilt-inGetHourfunctionthatextractsthehouroutofadatetimetype.Tousethis,wefirstconvertthetimestampstringtoISO8601withCustomFormatToISOandthentheresultingchararraytodatetimeusingthebuilt-inToDatefunction,asfollows:

hourly_tweets=FOREACHtweets{

GENERATE

GetHour(

ToDate(

CustomFormatToISO(

$0#'created_at','EEEMMMMdHH:mm:ssZy')

)

)ashour;

}

Now,itisjustamatterofgroupinghourly_tweetsbyhourandthengeneratingacountoftweetspergroup,asfollows:

hourly_tweets_count=FOREACH(GROUPhourly_tweetsBYhour){

GENERATEFLATTEN(group),COUNT(hourly_tweets);

}

SessionsDataFu’sSessionizeclasscanhelpustobettercaptureuseractivityovertime.Asessionrepresentstheactivityofauserwithinagivenperiodoftime.Forinstance,wecanlookateachuser’stweetstreamatintervalsof15minutesandmeasurethesesessionstodeterminebothnetworkvolumesaswellasuseractivity:

DEFINESessionizedatafu.pig.sessions.Sessionize('15m');

users_activity=FOREACHtweets{

Page 262: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GENERATE

CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')ASdt,

(chararray)$0#'user'#'id'asuser_id;

}

users_activity_sessionized=FOREACH

(GROUPusers_activityBYuser_id){

ordered=ORDERusers_activityBYdt;

GENERATEFLATTEN(Sessionize(ordered))

AS(dt,user_id,session_id);

}

user_activitysimplyrecordsthetimedtagivenuser_idpostedastatusupdate.

Sessionizetakesthesessiontimeoutandabagasinput.ThefirstelementoftheinputbagisanISO8601timestamp,andthebagmustbesortedbythistimestamp.Eventsthatarewithin15minutesfromeachotherwillbelongtothesamesession.

Itreturnstheinputbagwithanewfield,session_id,thatuniquelyidentifiesasession.Withthisdata,wecancalculatethesession’slengthandsomeotherstatistics.MoreexamplesofSessionizeusagecanbefoundathttp://datafu.incubator.apache.org/docs/datafu/guide/sessions.html.

Page 263: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CapturinguserinteractionsIntheremainderofthechapter,wewilllookathowtocapturepatternsfromuserinteractions.Asafirststepinthisdirection,wewillcreateadatasetsuitabletomodelasocialnetwork.Thisdatasetwillcontainatimestamp,theIDofthetweet,theuserwhopostedthetweet,theuserandtweetshe’sreplyingto,andthehashtaginthetweet.

Twitterconsidersasareply(in_reply_to_status_id_str)anymessagebeginningwiththe@character.Suchtweetsareinterpretedasadirectmessagetothatperson.Placingan@characteranywhereelseinthetweetisinterpretedasamention('entities'#'user_mentions‘)andnotareply.Thedifferenceisthatmentionsareimmediatelybroadcasttoaperson’sfollowers,whereasrepliesarenot.Repliesare,however,consideredasmentions.

Whenworkingwithpersonallyidentifiableinformation,itisagoodideatoanonymizeifnotremoveentirelysensitivedatasuchasIPaddresses,names,anduserIDs.Acommonlyusedtechniqueinvolvesahashfunctionthattakesasinputthedatawewanttoanonymize,concatenatedwithadditionalrandomdatacalledsalt.Thefollowingcodeshowsanexampleofsuchanonymization:

DEFINESHAdatafu.pig.hash.SHA();

from_to_bag=FOREACHtweets{

dt=$0#'created_at';

user_id=(chararray)$0#'user'#'id';

tweet_id=(chararray)$0#'id_str';

reply_to_tweet=(chararray)$0#'in_reply_to_status_id_str';

reply_to=(chararray)$0#'in_reply_to_user_id_str';

place=$0#'place';

topics=$0#'entities'#'hashtags';

GENERATE

CustomFormatToISO(dt,'EEEMMMMdHH:mm:ssZy')ASdt,

SHA((chararray)CONCAT('SALT',user_id))ASsource,

SHA(((chararray)CONCAT('SALT',tweet_id)))AStweet_id,

((reply_to_tweetISNULL)

?NULL

:SHA((chararray)CONCAT('SALT',reply_to_tweet)))

ASreply_to_tweet_id,

((reply_toISNULL)

?NULL

:SHA((chararray)CONCAT('SALT',reply_to)))

ASdestination,

(chararray)place#'country_code'ascountry,

FLATTEN(topics)AStopic;

}

—extractthehashtagtext

from_to=FOREACHfrom_to_bag{

GENERATE

dt,

tweet_id,

reply_to_tweet_id,

source,

Page 264: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

destination,

country,

(chararray)topic#'text'AStopic;

}

Inthisexample,weuseCONCATtoappenda(notsorandom)saltstringtopersonaldata.WethengenerateahashofthesaltedIDswithDataFu’sSHAfunction.TheSHAfunctionrequiresitsinputparameterstobenonnull.Weenforcethisconditionusingif-then-elsestatements.InPigLatin,thisisexpressedas<conditionistrue>?<truebranch>:<falsebranch>.Ifthestringisnull,wereturnNULL,andifnot,wereturnthesaltedhash.Tomakecodemorereadable,weusealiasesforthetweetJSONfieldsandreferencethemintheGENERATEblock.

Page 265: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LinkanalysisWecanredefineourapproachtodeterminetrendingtopicstoincludeusers’reactions.Afirst,naïve,approachcouldbetoconsideratopicasimportantifitcausedanumberofreplieslargerthanathresholdvalue.

Aproblemwiththisapproachisthattweetsgeneraterelativelyfewreplies,sothevolumeoftheresultingdatasetwillbelow.Hence,itrequiresaverylargeamountofdatatocontaintweetsbeingrepliedtoandproduceanyresult.Inpractice,wewouldlikelywanttocombinethismetricwithotherones(forexample,mentions)inordertoperformmoremeaningfulanalyses.

Tosatisfythisquery,wewillcreateanewdatasetthatincludesthehashtagsextractedfromboththetweetandtheoneauserisreplyingto:

tweet_hashtag=FOREACHfrom_toGENERATEtweet_id,topic;

from_to_self_joined=JOINfrom_toBYreply_to_tweet_idLEFT,

tweet_hashtagBYtweet_id;

twitter_graph=FOREACHfrom_to_self_joined{

GENERATE

from_to::dtASdt,

from_to::tweet_idAStweet_id,

from_to::reply_to_tweet_idASreply_to_tweet_id,

from_to::sourceASsource,

from_to::destinationASdestination,

from_to::topicAStopic,

from_to::countryAScountry,

tweet_hashtag::topicAStopic_replied;

}

NotethatPigdoesnotallowacrossjoinonthesamerelation,hencewehavetocreatetweet_hashtagfortheright-handsideofthejoin.Here,weusethe::operatortodisambiguatefromwhichrelationandcolumnwewanttoselectrecords.

Onceagain,wecanlookforthetop10topicsbynumberofrepliesusingthetop_nmacro:

top_10_topics=top_n(twitter_graph,topic_replied,10);

Countingthingswillonlytakeussofar.WecancomputemoredescriptivestatisticsonthisdatasetwithDataFu.UsingtheQuantilefunction,wecancalculatethemedian,the90th,95th,andthe99thpercentilesofthenumberofhashtagreactions,asfollows:

DEFINEQuantiledatafu.pig.stats.Quantile('0.5','0.90','0.95','0.99');

SincetheUDFexpectsanorderedbagofintegervaluesasinput,wefirstcountthefrequencyofeachtopic_repliedentry,asfollows.

topics_with_replies_grpd=GROUPtwitter_graphBYtopic_replied;

topics_with_replies_cnt=FOREACHtopics_with_replies_grpd{

GENERATE

COUNT(twitter_graph)ascnt;

}

Page 266: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Then,weapplyQuantileonthebagoffrequencies,asfollows:

quantiles=FOREACH(GROUPtopics_with_replies_cntALL){

sorted=ORDERtopics_with_replies_cntBYcnt;

GENERATEQuantile(sorted);

}

Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/graph.pig.

Page 267: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

InfluentialusersWewillusePageRank,analgorithmdevelopedbyGoogletorankwebpages(http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf),toidentifyinfluentialusersintheTwittergraphwegeneratedintheprevioussection.

Thistypeofanalysishasanumberofusecases,suchastargetedandcontextualadvertisement,recommendationsystems,spamdetection,andobviouslymeasuringtheimportanceofwebpages.Asimilarapproach,usedbyTwittertoimplementtheWhotoFollowfeature,isdescribedintheresearchpaperWTF:TheWhotoFollowserviceatTwitterfoundathttp://stanford.edu/~rezab/papers/wtf_overview.pdf.

Informally,PageRankdeterminestheimportanceofapagebasedontheimportanceofotherpageslinkingtoitandassignsitascorebetween0and1.AhighPageRankscoreindicatesthatalotofpagespointtoit.Intuitively,beinglinkedbypageswithahighPageRankisaqualityendorsement.IntermsoftheTwittergraph,weassumethatusersreceivingalotofrepliesareimportantorinfluentialwithinthesocialnetwork.InTwitter’scase,weconsideranextendeddefinitionofPageRank,wherethelinkbetweentwousersisgivenbyadirectreplyandlabeledbyanyeventualhashtagpresentinthemessage.Heuristically,wewanttoidentifyinfluentialusersonagiventopic.

InDataFu’simplementation,eachgraphisrepresentedasabagof(source,edges)tuples.ThesourcetupleisanintegerIDrepresentingthesourcenode.Theedgesareabagof(destination,weight)tuples.destinationisanintegerIDrepresentingthedestinationnode.weightisadoublerepresentinghowmuchtheedgeshouldbeweighted.TheoutputoftheUDFisabagof(source,rank)pairs,whererankisthePageRankvalueforthesourceuserinthegraph.Noticethatwetalkedaboutnodes,edges,andgraphsasabstractconcepts.InGoogle’scase,nodesarewebpages,edgesarelinksfromonepagetotheother,andgraphsaregroupsofpagesconnecteddirectlyandindirectly.

Inourcase,nodesrepresentusers,edgesrepresentin_reply_to_user_id_strmentions,andedgesarelabeledbyhashtagsintweets.TheoutputofPageRankshouldsuggestwhichusersareinfluentialonagiventopicgiventheirinteractionpatterns.

Inthissection,wewillwriteapipelineto:

RepresentdataasagraphwhereeachnodeisauserandahashtaglabelstheedgeMapIDsandhashtagstointegerssothattheycanbeconsumedbyPageRankApplyPageRankStoretheresultsintoHDFSinaninteroperableformat(Avro)

Werepresentthegraphasabagoftuplesintheform(source,destination,topic),whereeachtuplerepresentstheinteractionbetweennodes.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/pagerank.pig.

Wewillmapusers’andhashtags’texttonumericalIDs.WeusetheJavaStringhashCode()methodtoperformthisconversionstepandwrapthelogicinanEvalUDF.

Page 268: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

NoteThesizeofanintegeriseffectivelytheupperboundforthenumberofnodesandedgesinthegraph.Forproductioncode,itisrecommendedthatyouuseamorerobusthashfunction.

TheStringToIntclasstakesastringasinput,callsthehashCode()method,andreturnsthemethodoutputtoPig.TheUDFcodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch6/udf/com/learninghadoop2/pig/udf/StringToInt.java.

packagecom.learninghadoop2.pig.udf;

importjava.io.IOException;

importorg.apache.pig.EvalFunc;

importorg.apache.pig.data.Tuple;

publicclassStringToIntextendsEvalFunc<Integer>{

publicIntegerexec(Tupleinput)throwsIOException{

if(input==null||input.size()==0)

returnnull;

try{

Stringstr=(String)input.get(0);

returnstr.hashCode();

}catch(Exceptione){

throw

newIOException("CannotconvertStringtoInt",e);

}

}

}

Weextendorg.apache.pig.EvalFuncandoverridetheexecmethodtoreturnstr.hashCode()onthefunctioninput.TheEvalFunc<Integer>classisparameterizedwiththereturntypeoftheUDF(Integer).

Next,wecompiletheclassandarchiveitintoaJAR,asfollows:

$javac-classpath/opt/cloudera/parcels/CDH/lib/pig/pig.jar:$(hadoop

classpath)com/learninghadoop2/pig/udf/StringToInt.java

$jarcvfmyudfs-pig.jarcom/learninghadoop2/pig/udf/StringToInt.class

WecannowregistertheUDFinPigandcreateanaliastoStringToInt,asfollows:

REGISTERmyudfs-pig.jar

DEFINEStringToIntcom.learninghadoop2.pig.udf.StringToInt();

Wefilterouttweetswithnodestinationandnotopic,asfollows:

tweets_graph_filtered=FILTERtwitter_graphby

(destinationISNOTNULL)AND

(topicISNOTnull);

Then,weconvertthesource,destination,andtopictointegerIDs:

from_to=foreachtweets_graph_filtered{

GENERATE

StringToInt(source)assource_id,

Page 269: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StringToInt(destination)asdestination_id,

StringToInt(topic)astopic_id;

}

Oncedataisintheappropriateformat,wecanreusetheimplementationofPageRankandtheexamplecode(foundathttps://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/linkanalysis/PageRank.java)providedbyDataFu,asshowninthefollowingcode:

DEFINEPageRankdatafu.pig.linkanalysis.PageRank('dangling_nodes','true');

Webeginbycreatingabagof(source_id,destination_id,topic_id)tuples,asfollows:

reply_to=groupfrom_toby(source_id,destination_id,topic_id);

Wecounttheoccurrencesofeachtuple,thatis,howmanytimestwopeopletalkedaboutatopic,asfollows:

topic_edges=foreachreply_to{

GENERATEflatten(group),((double)COUNT(from_to.topic_id))asw;

}

Rememberthattopicistheedgeofourgraph;webeginbycreatinganassociationbetweenthesourcenodeandthetopicedge,asfollows:

topic_edges_grouped=GROUPtopic_edgesby(topic_id,source_id);

Thenweregroupitwiththepurposeofaddingadestinationnodeandtheedgeweight,asfollows:

topic_edges_grouped=FOREACHtopic_edges_grouped{

GENERATE

group.topic_idastopic,

group.source_idassource,

topic_edges.(destination_id,w)asedges;

}

OncewecreatetheTwittergraph,wecalculatethePageRankofallusers(source_id):

topic_rank=FOREACH(GROUPtopic_edges_groupedBYtopic){

GENERATE

groupastopic,

FLATTEN(PageRank(topic_edges_grouped.(source,edges)))as(source,rank);

}

topic_rank=FOREACHtopic_rankGENERATEtopic,source,rank;

WestoretheresultinHDFSinAvroformat.IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReducejarfiletoourenvironmentbeforeaccessingindividualfields.WithinPig,forexample,ontheClouderaCDH5VM:

REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro.jar

REGISTER/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar

STOREtopic_rankINTO'replies-pagerank'usingAvroStorage();

Note

Page 270: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Intheselasttwosections,wemadeanumberofimplicitassumptionsonwhataTwittergraphmightlooklikeandwhattheconceptsoftopicanduserinteractionmean.Giventheconstraintsthatweposed,theresultingsocialnetworkweanalyzedwillberelativelysmallandnotnecessarilyrepresentativeoftheentireTwittersocialnetwork.Extrapolatingresultsfromthisdatasetisdiscouraged.Inpractice,therearemanyotherfactorsthatshouldbetakenintoaccounttogeneratearobustmodelofsocialinteraction.

Page 271: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryInthischapter,weintroducedApachePig,aplatformforlarge-scaledataanalysisonHadoop.Inparticular,wecoveredthefollowingtopics:

ThegoalsofPigasawayofprovidingadataflow-likeabstractionthatdoesnotrequirehands-onMapReducedevelopmentHowPig’sapproachtoprocessingdatacomparestoSQL,wherePigisproceduralwhileSQLisdeclarativeGettingstartedwithPig—aneasytask,asitisalibrarythatgeneratescustomcodeanddoesn’trequireadditionalservicesAnoverviewofthedatatypes,corefunctions,andextensionmechanismsprovidedbyPigExamplesofapplyingPigtoanalyzetheTwitterdatasetindetail,whichdemonstrateditsabilitytoexpresscomplexconceptsinaveryconcisefashionHowlibrariessuchasPiggybank,ElephantBird,andDataFuproviderepositoriesfornumeroususefulprewrittenPigfunctionsInthenextchapter,wewillrevisittheSQLcomparisonbyexploringtoolsthatexposeaSQL-likeabstractionoverdatastoredinHDFS

Page 272: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter7.HadoopandSQLMapReduceisapowerfulparadigmthatenablescomplexdataprocessingthatcanrevealvaluableinsights.Asdiscussedinearlierchaptershowever,itdoesrequireadifferentmindsetandsometrainingandexperienceonthemodelofbreakingprocessinganalyticsintoaseriesofmapandreducesteps.ThereareseveralproductsthatarebuiltatopHadooptoprovidehigher-levelormorefamiliarviewsofthedataheldwithinHDFS,andPigisaverypopularone.ThischapterwillexploretheothermostcommonabstractionimplementedatopHadoop:SQL.

Inthischapter,wewillcoverthefollowingtopics:

WhattheusecasesforSQLonHadoopareandwhyitissopopularHiveQL,theSQLdialectintroducedbyApacheHiveUsingHiveQLtoperformSQL-likeanalysisoftheTwitterdatasetHowHiveQLcanapproximatecommonfeaturesofrelationaldatabasessuchasjoinsandviewsHowHiveQLallowstheincorporationofuser-definedfunctionsintoitsqueriesHowSQLonHadoopcomplementsPigOtherSQL-on-HadoopproductssuchasImpalaandhowtheydifferfromHive

Page 273: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WhySQLonHadoopSofarwehaveseenhowtowriteHadoopprogramsusingtheMapReduceAPIsandhowPigLatinprovidesascriptingabstractionandawrapperforcustombusinesslogicbymeansofUDFs.Pigisaverypowerfultool,butitsdataflow-basedprogrammingmodelisnotfamiliartomostdevelopersorbusinessanalysts.ThetraditionaltoolofchoiceforsuchpeopletoexploredataisSQL.

Backin2008FacebookreleasedHive,thefirstwidelyusedimplementationofSQLonHadoop.

Insteadofprovidingawayofmorequicklydevelopingmapandreducetasks,HiveoffersanimplementationofHiveQL,aquerylanguagebasedonSQL.HivetakesHiveQLstatementsandimmediatelyandautomaticallytranslatesthequeriesintooneormoreMapReducejobs.ItthenexecutestheoverallMapReduceprogramandreturnstheresultstotheuser.

ThisinterfacetoHadoopnotonlyreducesthetimerequiredtoproduceresultsfromdataanalysis,italsosignificantlywidensthenetastowhocanuseHadoop.Insteadofrequiringsoftwaredevelopmentskills,anyonewho’sfamiliarwithSQLcanuseHive.

ThecombinationoftheseattributesisthatHiveQLisoftenusedasatoolforbusinessanddataanalyststoperformadhocqueriesonthedatastoredonHDFS.WithHive,thedataanalystcanworkonrefiningquerieswithouttheinvolvementofasoftwaredeveloper.JustaswithPig,HivealsoallowsHiveQLtobeextendedbymeansofUserDefinedFunctions,enablingthebaseSQLdialecttobecustomizedwithbusiness-specificfunctionality.

Page 274: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OtherSQL-on-HadoopsolutionsThoughHivewasthefirstproducttointroduceandsupportHiveQL,itisnolongertheonlyone.Laterinthischapter,wewillalsodiscussImpala,releasedin2013andalreadyaverypopulartool,particularlyforlow-latencyqueries.Thereareothers,butwewillmostlydiscussHiveandImpalaastheyhavebeenthemostsuccessful.

WhileintroducingthecorefeaturesandcapabilitiesofSQLonHadoophowever,wewillgiveexamplesusingHive;eventhoughHiveandImpalasharemanySQLfeatures,theyalsohavenumerousdifferences.Wedon’twanttoconstantlyhavetocaveateachnewfeaturewithexactlyhowitissupportedinHivecomparedtoImpala.We’llgenerallybelookingataspectsofthefeaturesetthatarecommontoboth,butifyouusebothproducts,it’simportanttoreadthelatestreleasenotestounderstandthedifferences.

Page 275: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PrerequisitesBeforedivingintospecifictechnologies,let’sgeneratesomedatathatwe’lluseintheexamplesthroughoutthischapter.We’llcreateamodifiedversionofaformerPigscriptasthemainfunctionalityforthis.ThescriptinthischapterassumesthattheElephantBirdJARsusedpreviouslyareavailableinthe/jardirectoryonHDFS.Thefullsourcecodeisathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/extract_for_hive.pig,butthecoreofextract_for_hive.pigisasfollows:

--loadJSONdata

tweets=load'$inputDir'using

com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');—Tweets

tweets_tsv=foreachtweets{

generate

(chararray)CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)$0#'id_str',

(chararray)$0#'text'astext,

(chararray)$0#'in_reply_to',

(boolean)$0#'retweeted'asis_retweeted,

(chararray)$0#'user'#'id_str'asuser_id,(chararray)$0#'place'#'id'as

place_id;

}

storetweets_tsvinto'$outputDir/tweets'

usingPigStorage('\u0001');—Places

needed_fields=foreachtweets{

generate

(chararray)CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)$0#'id_str'asid_str,

$0#'place'asplace;

}

place_fields=foreachneeded_fields{

generate

(chararray)place#'id'asplace_id,

(chararray)place#'country_code'asco,

(chararray)place#'country'ascountry,

(chararray)place#'name'asplace_name,

(chararray)place#'full_name'asplace_full_name,

(chararray)place#'place_type'asplace_type;

}

filtered_places=filterplace_fieldsbyco!='';

unique_places=distinctfiltered_places;

storeunique_placesinto'$outputDir/places'

usingPigStorage('\u0001');

—Users

users=foreachtweets{

generate

(chararray)CustomFormatToISO($0#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)$0#'id_str'asid_str,

$0#'user'asuser;

Page 276: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

}

user_fields=foreachusers{

generate

(chararray)CustomFormatToISO(user#'created_at',

'EEEMMMMdHH:mm:ssZy')asdt,

(chararray)user#'id_str'asuser_id,

(chararray)user#'location'asuser_location,

(chararray)user#'name'asuser_name,

(chararray)user#'description'asuser_description,

(int)user#'followers_count'asfollowers_count,

(int)user#'friends_count'asfriends_count,

(int)user#'favourites_count'asfavourites_count,

(chararray)user#'screen_name'asscreen_name,

(int)user#'listed_count'aslisted_count;

}

unique_users=distinctuser_fields;

storeunique_usersinto'$outputDir/users'

usingPigStorage('\u0001');

Runthisscriptasfollows:

$pig–fextract_for_hive.pig–paraminputDir=<jsoninput>-param

outputDir=<outputpath>

TheprecedingcodewritesdataintothreeseparateTSVfilesforthetweet,user,andplaceinformation.Noticethatinthestorecommand,wepassanargumentwhencallingPigStorage.ThissingleargumentchangesthedefaultfieldseparatorfromatabcharactertounicodevalueU0001,oryoucanalsouseCtrl+C+A.ThisisoftenusedasaseparatorinHivetablesandwillbeparticularlyusefultousasourtweetdatacouldcontaintabsinotherfields.

Page 277: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OverviewofHiveWewillnowshowhowyoucanimportdataintoHiveandrunaqueryagainstthetableabstractionHiveprovidesoverthedata.Inthisexample,andintheremainderofthechapter,wewillassumethatqueriesaretypedintotheshellthatcanbeinvokedbyexecutingthehivecommand.

RecentlyaclientcalledBeelinealsobecameavailableandwilllikelybethepreferredCLIclientinthenearfuture.

WhenimportinganynewdataintoHive,thereisgenerallyathree-stageprocess:

CreatethespecificationofthetableintowhichthedataistobeimportedImportthedataintothecreatedtableExecuteHiveQLqueriesagainstthetable

MostoftheHiveQLstatementsaredirectanaloguestosimilarlynamedstatementsinstandardSQL.WeassumeonlyapassingknowledgeofSQLthroughoutthischapter,butifyouneedarefresher,therearenumerousgoodonlinelearningresources.

Hivegivesastructuredqueryviewofourdata,andtoenablethat,wemustfirstdefinethespecificationofthetable’scolumnsandimportthedataintothetablebeforewecanexecuteanyqueries.AtablespecificationisgeneratedusingaCREATEstatementthatspecifiesthetablename,thenameandtypesofitscolumns,andsomemetadataabouthowthetableisstored:

CREATEtabletweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

Thestatementcreatesanewtabletweetsdefinedbyalistofnamesforcolumnsinthedatasetandtheirdatatype.WespecifythatfieldsaredelimitedbytheUnicodeU0001characterandthattheformatusedtostoredataisTEXTFILE.

DatacanbeimportedfromalocationinHDFStweets/usingtheLOADDATAstatement:

LOADDATAINPATH'tweets'OVERWRITEINTOTABLEtweets;

Bydefault,dataforHivetablesisstoredonHDFSunder/user/hive/warehouse.IfaLOADstatementisgivenapathtodataonHDFS,itwillnotsimplycopythedatainto/user/hive/warehouse,butwillmoveitthereinstead.IfyouwanttoanalyzedataonHDFSthatisusedbyotherapplications,theneithercreateacopyorusetheEXTERNALmechanismthatwillbedescribedlater.

Page 278: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OncedatahasbeenimportedintoHive,wecanrunqueriesagainstit.Forinstance:

SELECTCOUNT(*)FROMtweets;

Theprecedingcodewillreturnthetotalnumberoftweetspresentinthedataset.HiveQL,likeSQL,isnotcasesensitiveintermsofkeywords,columns,ortablenames.Byconvention,SQLstatementsuseuppercaseforSQLlanguagekeywords,andwewillgenerallyfollowthiswhenusingHiveQLwithinfiles,aswillbeshownlater.However,whentypinginteractivecommands,wewillfrequentlytakethelineofleastresistanceanduselowercase.

Ifyoulookcloselyatthetimetakenbythevariouscommandsintheprecedingexample,you’llnoticethatloadingdataintoatabletakesaboutaslongascreatingthetablespecification,buteventhesimplecountofallrowstakessignificantlylonger.TheoutputalsoshowsthattablecreationandtheloadingofdatadonotactuallycauseMapReducejobstobeexecuted,whichexplainstheveryshortexecutiontimes.

Page 279: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ThenatureofHivetablesAlthoughHivecopiesthedatafileintoitsworkingdirectory,itdoesnotactuallyprocesstheinputdataintorowsatthatpoint.

BoththeCREATETABLEandLOADDATAstatementsdonottrulycreateconcretetabledataassuch;instead,theyproducethemetadatathatwillbeusedwhenHivegeneratesMapReducejobstoaccessthedataconceptuallystoredinthetablebutactuallyresidingonHDFS.EventhoughtheHiveQLstatementsrefertoaspecifictablestructure,itisHive’sresponsibilitytogeneratecodethatcorrectlymapsthistotheactualon-diskformatinwhichthedatafilesarestored.

ThismightseemtosuggestthatHiveisn’tarealdatabase;thisistrue,itisn’t.Whereasarelationaldatabasewillrequireatableschematobedefinedbeforedataisingestedandtheningestonlydatathatconformstothatspecification,Hiveismuchmoreflexible.ThelessconcretenatureofHivetablesmeansthatschemascanbedefinedbasedonthedataasithasalreadyarrivedandnotonsomeassumptionofhowthedatashouldbe,whichmightprovetobewrong.Thoughchangeabledataformatsaretroublesomeregardlessoftechnology,theHivemodelprovidesanadditionaldegreeoffreedominhandlingtheproblemwhen,notif,itarises.

Page 280: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HivearchitectureUntilversion2,Hadoopwasprimarilyabatchsystem.Aswesawinpreviouschapters,MapReducejobstendtohavehighlatencyandoverheadderivedfromsubmissionandscheduling.Internally,HivecompilesHiveQLstatementsintoMapReducejobs.Hivequerieshavetraditionallybeencharacterizedbyhighlatency.ThishaschangedwiththeStingerinitiativeandtheimprovementsintroducedinHive0.13thatwewilldiscusslater.

HiverunsasaclientapplicationthatprocessesHiveQLqueries,convertsthemintoMapReducejobs,andsubmitsthesetoaHadoopclustereithertonativeMapReduceinHadoop1ortotheMapReduceApplicationMasterrunningonYARNinHadoop2.

Regardlessofthemodel,Hiveusesacomponentcalledthemetastore,inwhichitholdsallitsmetadataaboutthetablesdefinedinthesystem.Ironically,thisisstoredinarelationaldatabasededicatedtoHive’susage.IntheearliestversionsofHive,allclientscommunicateddirectlywiththemetastore,butthismeantthateveryuseroftheHiveCLItoolneededtoknowthemetastoreusernameandpassword.

HiveServerwascreatedtoactasapointofentryforremoteclients,whichcouldalsoactasasingleaccess-controlpointandwhichcontrolledallaccesstotheunderlyingmetastore.BecauseoflimitationsinHiveServer,thenewestwaytoaccessHiveisthroughthemulti-clientHiveServer2.

NoteHiveServer2introducesanumberofimprovementsoveritspredecessor,includinguserauthenticationandsupportformultipleconnectionsfromthesameclient.Moreinformationcanbefoundathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2.

InstancesofHiveServerandHiveServer2canbemanuallyexecutedwiththehive--servicehiveserverandhive--servicehiveserver2commands,respectively.

Intheexampleswesawbeforeandintheremainderofthischapter,weimplicitlyuseHiveServertosubmitqueriesviatheHivecommand-linetool.HiveServer2comeswithBeeline.Forcompatibilityandmaturityreasons,Beelinebeingrelativelynew,bothtoolsareavailableonClouderaandmostothermajordistributions.TheBeelineclientispartofthecoreApacheHivedistributionandsoisalsofullyopensource.Beelinecanbeexecutedinembeddedversionwiththefollowingcommand:

$beeline-ujdbc:hive2://

Page 281: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DatatypesHiveQLsupportsmanyofthecommondatatypesprovidedbystandarddatabasesystems.Theseincludeprimitivetypes,suchasfloat,double,int,andstring,throughtostructuredcollectiontypesthatprovidetheSQLanaloguestotypessuchasarrays,structs,andunions(structswithoptionsforsomefields).SinceHiveisimplementedinJava,primitivetypeswillbehaveliketheirJavacounterparts.WecandistinguishHivedatatypesintothefollowingfivebroadcategories:

Numeric:tinyint,smallint,int,bigint,float,double,anddecimalDateandtime:timestampanddateString:string,varchar,andcharCollections:array,map,struct,anduniontypeMisc:boolean,binary,andNULL

Page 282: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DDLstatementsHiveQLprovidesanumberofstatementstocreate,delete,andalterdatabases,tables,andviews.TheCREATEDATABASE<name>statementcreatesanewdatabasewiththegivenname.Adatabaserepresentsanamespacewheretableandviewmetadataiscontained.Ifmultipledatabasesarepresent,theUSE<databasename>statementspecifieswhichonetousetoquerytablesorcreatenewmetadata.Ifnodatabaseisexplicitlyspecified,Hivewillrunallstatementsagainstthedefaultdatabase.SHOW[DATABASES,TABLES,VIEWS]displaysthedatabasescurrentlyavailablewithinadatawarehouseandwhichtableandviewmetadataispresentwithinthedatabasecurrentlyinuse:

CREATEDATABASEtwitter;

SHOWdatabases;

USEtwitter;

SHOWTABLES;

TheCREATETABLE[IFNOTEXISTS]<name>statementcreatesatablewiththegivenname.Asalludedtoearlier,whatisreallycreatedisthemetadatarepresentingthetableanditsmappingtofilesonHDFSaswellasadirectoryinwhichtostorethedatafiles.Ifatableorviewwiththesamenamealreadyexists,Hivewillraiseanexception.

Bothtableandcolumnnamesarecaseinsensitive.InolderversionsofHive(0.12andearlier),onlyalphanumericandunderscorecharacterswereallowedintableandcolumnnames.AsofHive0.13,thesystemsupportsunicodecharactersincolumnnames.Reservedwords,suchasloadandcreate,needtobeescapedbybackticks(the`character)tobetreatedliterally.

TheEXTERNALkeywordspecifiesthatthetableexistsinresourcesoutofHive’scontrol,whichcanbeausefulmechanismtoextractdatafromanothersourceatthebeginningofaHadoop-basedExtract-Transform-Load(ETL)pipeline.TheLOCATIONclausespecifieswherethesourcefile(ordirectory)istobefound.TheEXTERNALkeywordandLOCATIONclausehavebeenusedinthefollowingcode:

CREATEEXTERNALTABLEtweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${input}/tweets';

Thistablewillbecreatedinthemetastorebutthedatawillnotbecopiedintothe/user/hive/warehousedirectory.

Tip

Page 283: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

NotethatHivehasnoconceptofprimarykeyoruniqueidentifier.Uniquenessanddatanormalizationareaspectstobeaddressedbeforeloadingdataintothedatawarehouse.

TheCREATEVIEW<viewname>…ASSELECTstatementcreatesaviewwiththegivenname.Forexample,wecancreateaviewtoisolateretweetsfromothermessages,asfollows:

CREATEVIEWretweets

COMMENT'Tweetsthathavebeenretweeted'

ASSELECT*FROMtweetsWHEREretweeted=true;

Unlessotherwisespecified,columnnamesarederivedfromthedefiningSELECTstatement.Hivedoesnotcurrentlysupportmaterializedviews.

TheDROPTABLEandDROPVIEWstatementsremovebothmetadataanddataforagiventableorview.WhendroppinganEXTERNALtableoraview,onlymetadatawillberemovedandtheactualdatafileswillnotbeaffected.

HiveallowstablemetadatatobealteredviatheALTERTABLEstatement,whichcanbeusedtochangeacolumntype,name,position,andcommentortoaddandreplacecolumns.

Whenaddingcolumns,itisimportanttorememberthatonlymetadatawillbechangedandnotthedatasetitself.Thismeansthatifweweretoaddacolumninthemiddleofthetablewhichdidn’texistinolderfiles,thenwhileselectingfromolderdata,wemightgetwrongvaluesinthewrongcolumns.Thisisbecausewewouldbelookingatoldfileswithanewformat.WewilldiscussdataandschemamigrationsinChapter8,DataLifecycleManagement,whendiscussingAvro.

Similarly,ALTERVIEW<viewname>AS<selectstatement>changesthedefinitionofanexistingview.

Page 284: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

FileformatsandstorageThedatafilesunderlyingaHivetablearenodifferentfromanyotherfileonHDFS.UserscandirectlyreadtheHDFSfilesintheHivetablesusingothertools.TheycanalsouseothertoolstowritetoHDFSfilesthatcanbeloadedintoHivethroughCREATEEXTERNALTABLEorthroughLOADDATAINPATH.

HiveusestheSerializerandDeserializerclasses,SerDe,aswellasFileFormattoreadandwritetablerows.AnativeSerDeisusedifROWFORMATisnotspecifiedorROWFORMATDELIMITEDisspecifiedinaCREATETABLEstatement.TheDELIMITEDclauseinstructsthesystemtoreaddelimitedfiles.DelimitercharacterscanbeescapedusingtheESCAPEDBYclause.

HivecurrentlyusesthefollowingFileFormatclassestoreadandwriteHDFSfiles:

TextInputFormatandHiveIgnoreKeyTextOutputFormat:willread/writedatainplaintextfileformatSequenceFileInputFormatandSequenceFileOutputFormat:classesread/writedataintheHadoopSequenceFileformat

Additionally,thefollowingSerDeclassescanbeusedtoserializeanddeserializedata:

MetadataTypedColumnsetSerDe:willread/writedelimitedrecordssuchasCSVortab-separatedrecordsThriftSerDe,andDynamicSerDe:willread/writeThriftobjects

JSONAsofversion0.13,Hiveshipswiththenativeorg.apache.hive.hcatalog.data.JsonSerDe.ForolderversionsofHive,Hive-JSON-Serde(foundathttps://github.com/rcongiu/Hive-JSON-Serde)isarguablyoneofthemostfeature-richJSONserialization/deserializationmodules.

WecanuseeithermoduletoloadJSONtweetswithoutanyneedforpreprocessingandjustdefineaHiveschemathatmatchesthecontentofaJSONdocument.Inthefollowingexample,weuseHive-JSON-Serde.

Aswithanythird-partymodule,weloadtheSerDeJARsintoHivewiththefollowingcode:

ADDJARJARjson-serde-1.3-jar-with-dependencies.jar;

Then,weissuetheusualCREATEstatement,asfollows:

CREATEEXTERNALTABLEtweets(

contributorsstring,

coordinatesstruct<

coordinates:array<float>,

type:string>,

created_atstring,

entitiesstruct<

hashtags:array<struct<

Page 285: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

indices:array<tinyint>,

text:string>>,

)

ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'

STOREDASTEXTFILE

LOCATION'tweets';

WiththisSerDe,wecanmapnesteddocuments(suchasentitiesorusers)tothestructormaptypes.WetellHivethatthedatastoredatLOCATION'tweets'istext(STOREDASTEXTFILE)andthateachrowisaJSONobject(ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe‘).InHive0.13andlater,wecanexpressthispropertyasROWFORMATSERDE'org.apache.hive.hcatalog.data.JsonSerDe'.

Manuallyspecifyingtheschemaforcomplexdocumentscanbeatediousanderror-proneprocess.Thehive-jsonmodule(foundathttps://github.com/hortonworks/hive-json)isahandyutilitytoanalyzelargedocumentsandgenerateanappropriateHiveschema.Dependingonthedocumentcollection,furtherrefinementmightbenecessary.

Inourexample,weusedaschemageneratedwithhive-jsonthatmapsthetweetsJSONtoanumberofstructdatatypes.Thisallowsustoquerythedatausingahandydotnotation.Forinstance,wecanextractthescreennameanddescriptionfieldsofauserobjectwiththefollowingcode:

SELECTuser.screen_name,user.descriptionFROMtweets_jsonLIMIT10;

AvroAvroSerde(https://cwiki.apache.org/confluence/display/Hive/AvroSerDe)allowsustoreadandwritedatainAvroformat.Startingfrom0.14,Avro-backedtablescanbecreatedusingtheSTOREDASAVROstatement,andHivewilltakecareofcreatinganappropriateAvroschemaforthetable.PriorversionsofHiveareabitmoreverbose.

Asanexample,let’sloadintoHivethePageRankdatasetwegeneratedinChapter6,DataAnalysiswithApachePig.ThisdatasetwascreatedusingPig’sAvroStorageclass,andhasthefollowingschema:

{

"type":"record",

"name":"record",

"fields":[

{"name":"topic","type":["null","int"]},

{"name":"source","type":["null","int"]},

{"name":"rank","type":["null","float"]}

]

}

ThetablestructureiscapturedinanAvrorecord,whichcontainsheaderinformation(anameandoptionalnamespacetoqualifythename)andanarrayofthefields.Eachfieldisspecifiedwithitsnameandtypeaswellasanoptionaldocumentationstring.

Forafewofthefields,thetypeisnotasinglevalue,butinsteadapairofvalues,oneofwhichisnull.ThisisanAvrounion,andthisistheidiomaticwayofhandlingcolumns

Page 286: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

thatmighthaveanullvalue.Avrospecifiesnullasaconcretetype,andanylocationwhereanothertypemighthaveanullvalueneedstobespecifiedinthisway.Thiswillbehandledtransparentlyforuswhenweusethefollowingschema.

Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:

CREATEEXTERNALTABLEtweets_pagerank

ROWFORMATSERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITHSERDEPROPERTIES('avro.schema.literal'='{

"type":"record",

"name":"record",

"fields":[

{"name":"topic","type":["null","int"]},

{"name":"source","type":["null","int"]},

{"name":"rank","type":["null","float"]}

]

}')

STOREDASINPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION'${data}/ch5-pagerank';

Then,lookatthefollowingtabledefinitionfromwithinHive(notealsothatHCatalog,whichwe’llintroduceinChapter8,DataLifeCycleManagement,alsosupportssuchdefinitions):

DESCRIBEtweets_pagerank;

OK

topicintfromdeserializer

sourceintfromdeserializer

rankfloatfromdeserializer

IntheDDL,wetoldHivethatdataisstoredinAvroformatusingAvroContainerInputFormatandAvroContainerOutputFormat.Eachrowneedstobeserializedanddeserializedusingorg.apache.hadoop.hive.serde2.avro.AvroSerDe.ThetableschemaisinferredbyHivefromtheAvroschemaembeddedinavro.schema.literal.

Alternatively,wecanstoreaschemaonHDFSandhaveHivereadittodeterminethetablestructure.Createtheprecedingschemainafilecalledpagerank.avsc—thisisthestandardfileextensionforAvroschemas.ThenplaceitonHDFS;weprefertohaveacommonlocationforschemafilessuchas/schema/avro.Finally,definethetableusingtheavro.schema.urlSerDepropertyWITHSERDEPROPERTIES('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc').

IfAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforeaccessingindividualfields.WithinHive,ontheClouderaCDH5VM:

ADDJAR/opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar;

Page 287: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Wecanalsousethistablelikeanyother.Forinstance,wecanquerythedatatoselecttheuserandtopicpairswithahighPageRank:

SELECTsource,topicfromtweets_pagerankWHERErank>=0.9;

InChapter8,DataLifecycleManagement,wewillseehowAvroandavro.schema.urlplayaninstrumentalroleinenablingschemamigrations.

ColumnarstoresHivecanalsotakeadvantageofcolumnarstorageviatheORC(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC)andParquet(https://cwiki.apache.org/confluence/display/Hive/Parquet)formats.

Ifatableisdefinedwithverymanycolumns,itisnotunusualforanygivenquerytoonlyprocessasmallsubsetofthesecolumns.ButeveninaSequenceFileeachfullrowandallitscolumnswillbereadfromdisk,decompressed,andprocessed.Thisconsumesalotofsystemresourcesfordatathatweknowinadvanceisnotofinterest.

Traditionalrelationaldatabasesalsostoredataonarowbasis,andatypeofdatabasecalledcolumnarchangedthistobecolumn-focused.Inthesimplestmodel,insteadofonefileforeachtable,therewouldbeonefileforeachcolumninthetable.Ifaqueryonlyneededtoaccessfivecolumnsinatablewith100columnsintotal,thenonlythefilesforthosefivecolumnswillberead.BothORCandParquetusethisprincipleaswellasotheroptimizationstoenablemuchfasterqueries.

Page 288: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

QueriesTablescanbequeriedusingthefamiliarSELECT…FROMstatement.TheWHEREstatementallowsthespecificationoffilteringconditions,GROUPBYaggregatesrecords,ORDERBYspecifiessortingcriteria,andLIMITspecifiesthenumberofrecordstoretrieve.Aggregatefunctions,suchascountandsum,canbeappliedtoaggregatedrecords.Forinstance,thefollowingcodereturnsthetop10mostprolificusersinthedataset:

SELECTuser_id,COUNT(*)AScntFROMtweetsGROUPBYuser_idORDERBYcnt

DESCLIMIT10

Thisreturnsthetop10mostprolificusersinthedataset:

22639496594

13321880534

9594688573

13677521183

3625629443

586460413

23752966883

14681885293

371142093

23850409403

Wecanimprovethereadabilityofthehiveoutputbysettingthefollowing:

SEThive.cli.print.header=true;

Thiswillinstructhive,thoughnotbeeline,toprintcolumnnamesaspartoftheoutput.

TipYoucanaddthecommandtothe.hivercfileusuallyfoundintherootoftheexecutinguser’shomedirectorytohaveitapplytoallhiveCLIsessions.

HiveQLimplementsaJOINoperatorthatenablesustocombinetablestogether.InthePrerequisitessection,wegeneratedseparatedatasetsfortheuserandplaceobjects.Let’snowloadthemintohiveusingexternaltables.

Wefirstcreateausertabletostoreuserdata,asfollows:

CREATEEXTERNALTABLEuser(

created_atstring,

user_idstring,

`location`string,

namestring,

descriptionstring,

followers_countbigint,

friends_countbigint,

favourites_countbigint,

screen_namestring,

listed_countbigint

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

Page 289: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

STOREDASTEXTFILE

LOCATION'${input}/users';

Wethencreateaplacetabletostorelocationdata,asfollows:

CREATEEXTERNALTABLEplace(

place_idstring,

country_codestring,

countrystring,

`name`string,

full_namestring,

place_typestring

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${input}/places';

WecanusetheJOINoperatortodisplaythenamesofthe10mostprolificusers,asfollows:

SELECTtweets.user_id,user.name,COUNT(tweets.user_id)AScnt

FROMtweets

JOINuserONuser.user_id=tweets.user_id

GROUPBYtweets.user_id,user.user_id,user.name

ORDERBYcntDESCLIMIT10;

TipOnlyequality,outer,andleft(semi)joinsaresupportedinHive.

NoticethattheremightbemultipleentrieswithagivenuserIDbutdifferentvaluesforthefollowers_count,friends_count,andfavourites_countcolumns.Toavoidduplicateentries,wecountonlyuser_idfromthetweetstable.

Wecanrewritethepreviousqueryasfollows:

SELECTtweets.user_id,u.name,COUNT(*)AScnt

FROMtweets

join(SELECTuser_id,nameFROMuserGROUPBYuser_id,name)u

ONu.user_id=tweets.user_id

GROUPBYtweets.user_id,u.name

ORDERBYcntDESCLIMIT10;

Insteadofdirectlyjoiningtheusertable,weexecuteasubquery,asfollows:

SELECTuser_id,nameFROMuserGROUPBYuser_id,name;

ThesubqueryextractsuniqueuserIDsandnames.NotethatHivehaslimitedsupportforsubqueries,historicallyonlypermittingasubqueryintheFROMclauseofaSELECTstatement.Hive0.13hasaddedlimitedsupportforsubquerieswithintheWHEREclausealso.

HiveQLisanever-evolvingrichlanguage,afullexpositionofwhichisbeyondthescopeofthischapter.Adescriptionofitsqueryandddlcapabilitiescanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual.

Page 290: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StructuringHivetablesforgivenworkloadsOftenHiveisn’tusedinisolation,insteadtablesarecreatedwithparticularworkloadsinmindorneedsinvokedinwaysthataresuitableforinclusioninautomatedprocesses.We’llnowexploresomeofthesescenarios.

Page 291: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PartitioningatableWithcolumnarfileformats,weexplainedthebenefitsofexcludingunneededdataasearlyaspossiblewhenprocessingaquery.AsimilarconcepthasbeenusedinSQLforsometime:tablepartitioning.

Whencreatingapartitionedtable,acolumnisspecifiedasthepartitionkey.Allvalueswiththatkeyarethenstoredtogether.InHive’scase,differentsubdirectoriesforeachpartitionkeyarecreatedunderthetabledirectoryinthewarehouselocationonHDFS.

It’simportanttounderstandthecardinalityofthepartitioncolumn.Withtoofewdistinctvalues,thebenefitsarereducedasthefilesarestillverylarge.Iftherearetoomanyvalues,thenqueriesmightneedalargenumberoffilestobescannedtoaccessalltherequireddata.Perhapsthemostcommonpartitionkeyisonebasedondate.Wecould,forexample,partitionourusertablefromearlierbasedonthecreated_atcolumn,thatis,thedatetheuserwasfirstregistered.Notethatsincepartitioningatablebydefinitionaffectsitsfilestructure,wecreatethistablenowasanon-externalone,asfollows:

CREATETABLEpartitioned_user(

created_atstring,

user_idstring,

`location`string,

namestring,

descriptionstring,

followers_countbigint,

friends_countbigint,

favourites_countbigint,

screen_namestring,

listed_countbigint

)PARTITIONEDBY(created_at_datestring)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

Toloaddataintoapartition,wecanexplicitlygiveavalueforthepartitionintowhichtoinsertthedata,asfollows:

INSERTINTOTABLEpartitioned_user

PARTITION(created_at_date='2014-01-01')

SELECT

created_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count

FROMuser;

Thisisatbestverbose,asweneedastatementforeachpartitionkeyvalue;ifasingle

Page 292: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LOADorINSERTstatementcontainsdataformultiplepartitions,itjustwon’twork.Hivealsohasafeaturecalleddynamicpartitioning,whichcanhelpushere.Wesetthefollowingthreevariables:

SEThive.exec.dynamic.partition=true;

SEThive.exec.dynamic.partition.mode=nonstrict;

SEThive.exec.max.dynamic.partitions.pernode=5000;

Thefirsttwostatementsenableallpartitions(nonstrictoption)tobedynamic.Thethirdoneallows5,000distinctpartitionstobecreatedoneachmapperandreducernode.

Wecanthensimplyusethenameofthecolumntobeusedasthepartitionkey,andHivewillinsertdataintopartitionsdependingonthevalueofthekeyforagivenrow:

INSERTINTOTABLEpartitioned_user

PARTITION(created_at_date)

SELECT

created_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count,

to_date(created_at)ascreated_at_date

FROMuser;

Eventhoughweuseonlyasinglepartitioncolumnhere,wecanpartitionatablebymultiplecolumnkeys;justhavethemasacomma-separatedlistinthePARTITIONEDBYclause.

Notethatthepartitionkeycolumnsneedtobeincludedasthelastcolumnsinanystatementbeingusedtoinsertintoapartitionedtable.IntheprecedingcodeweuseHive’sto_datefunctiontoconvertthecreated_attimestamptoaYYYY-MM-DDformattedstring.

PartitioneddataisstoredinHDFSas/path/to/warehouse/<database>/<table>/key=<value>.Inourexample,thepartitioned_usertablestructurewilllooklike/user/hive/warehouse/default/partitioned_user/created_at=2014-04-01.

Ifdataisaddeddirectlytothefilesystem,forinstancebysomethird-partyprocessingtoolorbyhadoopfs-put,themetastorewon’tautomaticallydetectthenewpartitions.TheuserwillneedtomanuallyrunanALTERTABLEstatementsuchasthefollowingforeachnewlyaddedpartition:

ALTERTABLE<table_name>ADDPARTITION<location>;

Toaddmetadataforallpartitionsnotcurrentlypresentinthemetastorewecanuse:MSCKREPAIRTABLE<table_name>;statement.OnEMR,thisisequivalenttoexecutingthefollowingstatement:

ALTERTABLE<table_name>RECOVERPARTITIONS;

Page 293: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

NoticethatbothstatementswillworkalsowithEXTERNALtables.Inthefollowingchapter,wewillseehowthispatterncanbeexploitedtocreateflexibleandinteroperablepipelines.

OverwritingandupdatingdataPartitioningisalsousefulwhenweneedtoupdateaportionofatable.Normallyastatementofthefollowingformwillreplaceallthedataforthedestinationtable:

INSERTOVERWRITEINTO<table>…

IfOVERWRITEisomitted,theneachINSERTstatementwilladdadditionaldatatothetable.Sometimes,thisisdesirable,butoften,thesourcedatabeingingestedintoaHivetableisintendedtofullyupdateasubsetofthedataandkeeptherestuntouched.

IfweperformanINSERTOVERWRITEstatement(oraLOADOVERWRITEstatement)intoapartitionofatable,thenonlythespecifiedpartitionwillbeaffected.Thus,ifwewereinsertinguserdataandonlywantedtoaffectthepartitionswithdatainthesourcefile,wecouldachievethisbyaddingtheOVERWRITEkeywordtoourpreviousINSERTstatement.

WecanalsoaddcaveatstotheSELECTstatement.Say,forexample,weonlywantedtoupdatedataforacertainmonth:

INSERTINTOTABLEpartitioned_user

PARTITION(created_at_date)

SELECTcreated_at,

user_id,

location,

name,

description,

followers_count,

friends_count,

favourites_count,

screen_name,

listed_count,

to_date(created_at)ascreated_at_date

FROMuser

WHEREto_date(created_at)BETWEEN'2014-03-01'and'2014-03-31';

BucketingandsortingPartitioningatableisaconstructthatyoutakeexplicitadvantageofbyusingthepartitioncolumn(orcolumns)intheWHEREclauseofqueriesagainstthetables.ThereisanothermechanismcalledbucketingthatcanfurthersegmenthowatableisstoredanddoessoinawaythatallowsHiveitselftooptimizeitsinternalqueryplanstotakeadvantageofthestructure.

Let’screatebucketedversionsofourtweetsandusertables;notethefollowingadditionalCLUSTERBYandSORTBYstatementsintheCREATETABLEstatements:

CREATEtablebucketed_tweets(

tweet_idstring,

textstring,

in_reply_tostring,

Page 294: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

retweetedboolean,

user_idstring,

place_idstring

)PARTITIONEDBY(created_atstring)

CLUSTEREDBY(user_ID)into64BUCKETS

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

CREATETABLEbucketed_user(

user_idstring,

`location`string,

namestring,

descriptionstring,

followers_countbigint,

friends_countbigint,

favourites_countbigint,

screen_namestring,

listed_countbigint

)PARTITIONEDBY(created_atstring)

CLUSTEREDBY(user_ID)SORTEDBY(name)into64BUCKETS

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE;

Notethatwechangedthetweetstabletoalsobepartitioned;youcanonlybucketatablethatispartitioned.

Justasweneedtospecifyapartitioncolumnwheninsertingintoapartitionedtable,wemustalsotakecaretoensurethatdatainsertedintoabucketedtableiscorrectlyclustered.Wedothisbysettingthefollowingflagbeforeinsertingthedataintothetable:

SEThive.enforce.bucketing=true;

Justaswithpartitionedtables,youcannotapplythebucketingfunctionwhenusingtheLOADDATAstatement;ifyouwishtoloadexternaldataintoabucketedtable,firstinsertitintoatemporarytable,andthenusetheINSERT…SELECT…syntaxtopopulatethebucketedtable.

Whendataisinsertedintoabucketedtable,rowsareallocatedtoabucketbasedontheresultofahashfunctionappliedtothecolumnspecifiedintheCLUSTEREDBYclause.

Oneofthegreatestadvantagesofbucketingatablecomeswhenweneedtojointwotablesthataresimilarlybucketed,asinthepreviousexample.So,forexample,anyqueryofthefollowingformwouldbevastlyimproved:

SEThive.optimize.bucketmapjoin=true;

SELECT…

FROMbucketed_useruJOINbucketed_tweett

ONu.user_id=t.user_id;

Withthejoinbeingperformedonthecolumnusedtobucketthetable,Hivecanoptimizetheamountofprocessingasitknowsthateachbucketcontainsthesamesetofuser_idcolumnsinbothtables.Whiledeterminingwhichrowsagainstwhichtomatch,onlythose

Page 295: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

inthebucketneedtobecomparedagainst,andnotthewholetable.Thisdoesrequirethatthetablesarebothclusteredonthesamecolumnandthatthebucketnumbersareeitheridenticaloroneisamultipleoftheother.Inthelattercase,withsayonetableclusteredinto32bucketsandanotherinto64,thenatureofthedefaulthashfunctionusedtoallocatedatatoabucketmeansthattheIDsinbucket3inthefirsttablewillcoverthoseinbothbuckets3and35inthesecond.

SamplingdataBucketingatablecanalsohelpwhileusingHive’sabilitytosampledatainatable.Samplingallowsaquerytogatheronlyaspecifiedsubsetoftheoverallrowsinthetable.Thisisusefulwhenyouhaveanextremelylargetablewithmoderatelyconsistentdatapatterns.Insuchacase,applyingaquerytoasmallfractionofthedatawillbemuchfasterandwillstillgiveabroadlyrepresentativeresult.Note,ofcourse,thatthisonlyappliestoquerieswhereyouarelookingtodeterminetablecharacteristics,suchaspatternrangesinthedata;ifyouaretryingtocountanything,thentheresultneedstobescaledtothefulltablesize.

Foranon-bucketedtable,youcansampleinamechanismsimilartowhatwesawearlierbyspecifyingthatthequeryshouldonlybeappliedtoacertainsubsetofthetable:

SELECTmax(friends_count)

FROMuserTABLESAMPLE(BUCKET2OUTOF64ONname);

Inthisquery,Hivewilleffectivelyhashtherowsinthetableinto64bucketsbasedonthenamecolumn.Itwillthenonlyusethesecondbucketforthequery.Multiplebucketscanbespecified,andifRAND()isgivenastheONclause,thentheentirerowisusedbythebucketingfunction.

Thoughsuccessful,thisishighlyinefficientasthefulltableneedstobescannedtogeneratetherequiredsubsetofdata.Ifwesampleonabucketedtableandensurethenumberofbucketssampledisequaltooramultipleofthebucketsinthetable,thenHivewillonlyreadthebucketsinquestion.Forexample:

SELECTMAX(friends_count)

FROMbucketed_userTABLESAMPLE(BUCKET2OUTOF32onuser_id);

Intheprecedingqueryagainstthebucketed_usertable,whichiscreatedwith64bucketsontheuser_idcolumn,thesampling,sinceitisusingthesamecolumn,willonlyreadtherequiredbuckets.Inthiscase,thesewillbebuckets2and34fromeachpartition.

Afinalformofsamplingisblocksampling.Inthiscase,wecanspecifytherequiredamountofthetabletobesampled,andHivewilluseanapproximationofthisbyonlyreadingenoughsourcedatablocksonHDFStomeettherequiredsize.Currently,thedatasizecanbespecifiedaseitherapercentageofthetable,asanabsolutedatasize,orasanumberofrows(ineachblock).ThesyntaxforTABLESAMPLEisasfollows,whichwillsample0.5percentofthetable,1GBofdataor100rowspersplit,respectively:

TABLESAMPLE(0.5PERCENT)

TABLESAMPLE(1G)

Page 296: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TABLESAMPLE(100ROWS)

Iftheselatterformsofsamplingareofinterest,thenconsultthedocumentation,astherearesomespecificlimitationsontheinputformatandfileformatsthataresupported.

Page 297: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WritingscriptsWecanplaceHivecommandsinafileandrunthemwiththe-foptioninthehiveCLIutility:

$catshow_tables.hql

showtables;

$hive-fshow_tables.hql

WecanparameterizeHiveQLstatementsbymeansofthehiveconfmechanism.Thisallowsustospecifyanenvironmentvariablenameatthepointitisusedratherthanatthepointofinvocation.Forexample:

$catshow_tables2.hql

showtableslike'${hiveconf:TABLENAME}';

$hive-hiveconfTABLENAME=user-fshow_tables2.hql

ThevariablecanalsobesetwithintheHivescriptoraninteractivesession:

SETTABLE_NAME='user';

TheprecedinghiveconfargumentwilladdanynewvariablesinthesamenamespaceastheHiveconfigurationoptions.AsofHive0.8,thereisasimilaroptioncalledhivevarthataddsanyuservariablesintoadistinctnamespace.Usinghivevar,theprecedingcommandwouldbeasfollows:

$catshow_tables3.hql

showtableslike'${hivevar:TABLENAME}';

$hive-hivevarTABLENAME=user–fshow_tables3.hql

Orwecanwritethecommandinteractively:

SEThivevar:TABLE_NAME='user';

Page 298: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HiveandAmazonWebServicesWithElasticMapReduceastheAWSHadoop-on-demandservice,itisofcoursepossibletorunHiveonanEMRcluster.ButitisalsopossibletouseAmazonstorageservices,particularlyS3,fromanyHadoopclusterbeitwithinEMRoryourownlocalcluster.

Page 299: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HiveandS3AsmentionedinChapter2,Storage,itispossibletospecifyadefaultfilesystemotherthanHDFSforHadoopandS3isoneoption.But,itdoesn’thavetobeanall-or-nothingthing;itispossibletohavespecifictablesstoredinS3.Thedataforthesetableswillberetrievedintotheclustertobeprocessed,andanyresultingdatacaneitherbewrittentoadifferentS3location(thesametablecannotbethesourceanddestinationofasinglequery)orontoHDFS.

WecantakeafileofourtweetdataandplaceitontoalocationinS3withacommandsuchasthefollowing:

$awss3puttweets.tsvs3://<bucket-name>/tweets/

Wefirstlyneedtospecifytheaccesskeyandsecretaccesskeythatcanaccessthebucket.Thiscanbedoneinthreeways:

Setfs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeytotheappropriatevaluesintheHiveCLISetthesamevaluesinhive-site.xmlthoughnotethislimitsuseofS3toasinglesetofcredentialsSpecifythetablelocationexplicitlyinthetableURL,thatis,s3n://<accesskey>:<secretaccesskey>@<bucket>/<path>

Thenwecancreateatablereferencingthisdata,asfollows:

CREATEtableremote_tweets(

created_atstring,

tweet_idstring,

textstring,

in_reply_tostring,

retweetedboolean,

user_idstring,

place_idstring

)CLUSTEREDBY(user_ID)into64BUCKETS

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

LOCATION's3n://<bucket-name>/tweets'

ThiscanbeanincrediblyeffectivewayofpullingS3dataintoalocalHadoopclusterforprocessing.

NoteInordertouseAWScredentialsintheURIofanS3locationregardlessofhowtheparametersarepassed,thesecretandaccesskeysmustnotcontain/,+,=,or\characters.Ifnecessary,anewsetofcredentialscanbegeneratedfromtheIAMconsoleathttps://console.aws.amazon.com/iam/.

Intheory,youcanjustleavethedataintheexternaltableandrefertoitwhenneededtoavoidWANdatatransferlatencies(andcosts),eventhoughitoftenmakessensetopull

Page 300: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

thedataintoalocaltableanddofutureprocessingfromthere.Ifthetableispartitioned,thenyoumightfindyourselfretrievinganewpartitioneachday,forexample.

Page 301: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HiveonElasticMapReduceOnonelevel,usingHivewithinAmazonElasticMapReduceisjustthesameaseverythingdiscussedinthischapter.Youcancreateapersistentcluster,logintothemasternode,andusetheHiveCLItocreatetablesandsubmitqueries.DoingallthiswillusethelocalstorageontheEC2instancesforthetabledata.

Notsurprisingly,jobsonEMRclusterscanalsorefertotableswhosedataisstoredonS3(orDynamoDB).Andalsonotsurprisingly,AmazonhasmadeextensionstoitsversionofHivetomakeallthisveryseamless.ItisquitesimplefromwithinanEMRjobtopulldatafromatablestoredinS3,processit,writeanyintermediatedatatotheEMRlocalstorage,andthenwritetheoutputresultsintoS3,DynamoDB,oroneofagrowinglistofotherAWSservices.

ThepatternmentionedearlierwherenewdataisaddedtoanewpartitiondirectoryforatableeachdayhasprovedveryeffectiveinS3;itisoftenthestoragelocationofchoiceforlargeandincrementallygrowingdatasets.ThereisasyntaxdifferencewhenusingEMR;insteadoftheMSCKcommandmentionedearlier,thecommandtoupdateaHivetablewithnewdataaddedtoapartitiondirectoryisasfollows:

ALTERTABLE<table-name>RECOVERPARTITIONS;

ConsulttheEMRdocumentationforthelatestenhancementsathttp://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html.Also,consultthebroaderEMRdocumentation.Inparticular,theintegrationpointswithotherAWSservicesisanareaofrapidgrowth.

Page 302: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ExtendingHiveQLTheHiveQLlanguagecanbeextendedbymeansofpluginsandthird-partyfunctions.InHive,therearethreetypesoffunctionscharacterizedbythenumberofrowstheytakeasinputandproduceasoutput:

UserDefinedFunctions(UDFs):aresimplerfunctionsthatactononerowatatime.UserDefinedAggregateFunctions(UDAFs):takemultiplerowsasinputandgeneratemultiplerowsasoutput.TheseareaggregatefunctionstobeusedinconjunctionwithaGROUPBYstatement(similartoCOUNT(),AVG(),MIN(),MAX(),andsoon).UserDefinedTableFunctions(UDTFs):takemultiplerowsasinputandgeneratealogicaltablecomprisedofmultiplerowsthatcanbeusedinjoinexpressions.

TipTheseAPIsareprovidedonlyinJava.Forotherlanguages,itispossibletostreamdatathroughauser-definedscriptusingtheTRANSFORM,MAP,andREDUCEclausesthatactasafrontendtoHadoop’sstreamingcapabilities.

TwoAPIsareavailabletowriteUDFs.AsimpleAPIorg.apache.hadoop.hive.ql.exec.UDFcanbeusedforfunctionsthattakeandreturnbasicwritabletypes.AricherAPI,whichprovidessupportfordatatypesotherthanwritableisavailableintheorg.apache.hadoop.hive.ql.udf.generic.GenericUDFpackage.We’llnowillustratehoworg.apache.hadoop.hive.ql.exec.UDFcanbeusedtoimplementastringtoIDfunctionsimilartotheoneweusedinChapter5,IterativeComputationwithSpark,tomaphashtagstointegersinPig.BuildingaUDFwiththisAPIonlyrequiresextendingtheUDFclassandwritinganevaluate()method,asfollows:

publicclassStringToIntextendsUDF{

publicIntegerevaluate(Textinput){

if(input==null)

returnnull;

Stringstr=input.toString();

returnstr.hashCode();

}

}

ThefunctiontakesaTextobjectasinputandmapsittoanintegervaluewiththehashCode()method.Thesourcecodeofthisfunctioncanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/udf/com/learninghadoop2/hive/udf/StringToInt.java.

TipAsnotedinChapter6,DataAnalysiswithApachePig,amorerobusthashfunctionshouldbeusedinproduction.

Page 303: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WecompiletheclassandarchiveitintoaJARfile,asfollows:

$javac-classpath$(hadoop

classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*

com/learninghadoop2/hive/udf/StringToInt.java

$jarcvfmyudfs-hive.jarcom/learninghadoop2/hive/udf/StringToInt.class

Beforebeingabletouseit,aUDFmustberegisteredinHivewiththefollowingcommands:

ADDJARmyudfs-hive.jar;

CREATETEMPORARYFUNCTIONstring_to_intAS

'com.learninghadoop2.hive.udf.StringToInt';

TheADDJARstatementaddsaJARfiletothedistributedcache.TheCREATETEMPORARYFUNCTION<function>AS<class>statementregistersafunctioninHivethatimplementsagivenJavaclass.ThefunctionwillbedroppedoncetheHivesessionisclosed.AsofHive0.13,itispossibletocreatepermanentfunctionswhosedefinitioniskeptinthemetastoreusingCREATEFUNCTION….

Onceregistered,StringToIntcanbeusedinaqueryjustlikeanyotherfunction.Inthefollowingexample,wefirstextractalistofhashtagsfromthetweet’stextbyapplyingregexp_extract.Then,weusestring_to_inttomapeachtagtoanumericalID:

SELECTunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag)AS

tag_idFROM

(

SELECTregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')ashashtag

FROMtweets

GROUPBYregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')

)unique_hashtagsGROUPBYunique_hashtags.hashtag,

string_to_int(unique_hashtags.hashtag);

Justaswedidinthepreviouschapter,wecanusetheprecedingquerytocreatealookuptable:

CREATETABLElookuptable(tagstring,tag_idbigint);

INSERTOVERWRITETABLElookuptable

SELECTunique_hashtags.hashtag,

string_to_int(unique_hashtags.hashtag)astag_id

FROM

(

SELECTregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')AShashtag

FROMtweets

GROUPBYregexp_extract(text,

'(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)')

)unique_hashtags

GROUPBYunique_hashtags.hashtag,string_to_int(unique_hashtags.hashtag);

Page 304: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ProgrammaticinterfacesInadditiontothehiveandbeelinecommand-linetools,itispossibletosubmitHiveQLqueriestothesystemviatheJDBCandThriftprogrammaticinterfaces.SupportforODBCwasbundledinolderversionsofHive,butasofHive0.12,itneedstobebuiltfromscratch.Moreinformationonthisprocesscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/HiveODBC.

Page 305: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

JDBCAHiveclientwrittenusingJDBCAPIslooksexactlythesameasaclientprogramwrittenforotherdatabasesystems(forexampleMySQL).ThefollowingisasampleHiveclientprogramusingJDBCAPIs.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveJdbcClient.java.

publicclassHiveJdbcClient{

privatestaticStringdriverName="org.apache.hive.jdbc.HiveDriver";

//connectionstring

publicstaticStringURL="jdbc:hive2://localhost:10000";

//Showalltablesinthedefaultdatabase

publicstaticStringQUERY="showtables";

publicstaticvoidmain(String[]args)throwsSQLException{

try{

Class.forName(driverName);

}

catch(ClassNotFoundExceptione){

e.printStackTrace();

System.exit(1);

}

Connectioncon=DriverManager.getConnection(URL);

Statementstmt=con.createStatement();

ResultSetresultSet=stmt.executeQuery(QUERY);

while(resultSet.next()){

System.out.println(resultSet.getString(1));

}

}

}

TheURLpartistheJDBCURIthatdescribestheconnectionendpoint.Theformatforestablishingaremoteconnectionisjdbc:hive2:<host>:<port>/<database>.Connectionsinembeddedmodecanbeestablishedbynotspecifyingahostorport,likejdbc:hive2://.

hiveandhive2arethedriverstobeusedwhenconnectingtoHiveServerandHiveServer2.QUERYcontainstheHiveQLquerytobeexecuted.

TipHive’sJDBCinterfaceexposesonlythedefaultdatabase.Inordertoaccessotherdatabases,youneedtoreferencethemexplicitlyintheunderlyingqueriesusingthe<database>.<table>notation.

FirstweloadtheHiveServer2JDBCdriverorg.apache.hive.jdbc.HiveDriver.

Tip

Page 306: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Useorg.apache.hadoop.hive.jdbc.HiveDrivertoconnecttoHiveServer.

Then,likewithanyotherJDBCprogram,weestablishaconnectiontoURLanduseittoinstantiateaStatementclass.WeexecuteQUERY,withnoauthentication,andstoretheoutputdatasetintotheResultSetobject.Finally,wescanresultSetandprintitscontenttothecommandline.

Compileandexecutetheexamplewiththefollowingcommands:

$javacHiveJdbcClient.java

$java-cp$(hadoop

classpath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:/opt/cloudera/parcels/C

DH/lib/hive/lib/hive-jdbc.jar:

com.learninghadoop2.hive.client.HiveJdbcClient

Page 307: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ThriftThriftprovideslower-levelaccesstoHiveandhasanumberofadvantagesovertheJDBCimplementationofHiveServer.Primarily,itallowsmultipleconnectionsfromthesameclient,anditallowsprogramminglanguagesotherthanJavatobeusedwithease.WithHiveServer2,itisalesscommonlyusedoptionbutstillworthmentioningforcompatibility.AsampleThriftclientimplementedusingtheJavaAPIcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch7/clients/com/learninghadoop2/hive/client/HiveThriftClient.java.ThisclientcanbeusedtoconnecttoHiveServer,butduetoprotocoldifferences,theclientwon’tworkwithHiveServer2.

IntheexamplewedefineagetClient()methodthattakesasinputthehostandportofaHiveServerserviceandreturnsaninstanceoforg.apache.hadoop.hive.service.ThriftHive.Client.

Aclientisobtainedbyfirstinstantiatingasocketconnection,org.apache.thrift.transport.TSocket,totheHiveServerservice,andbyspecifyingaprotocol,org.apache.thrift.protocol.TBinaryProtocol,toserializeandtransmitdata,asfollows:

TSockettransport=newTSocket(host,port);

transport.setTimeout(TIMEOUT);

transport.open();

TBinaryProtocolprotocol=newTBinaryProtocol(transport);

client=newThriftHive.Client(protocol);

WecallgetClient()fromthemainmethodandusetheclienttoexecuteaqueryagainstaninstanceofHiveServerrunningonlocalhostonport11111,asfollows:

publicstaticvoidmain(String[]args)throwsException{

Clientclient=getClient("localhost",11111);

client.execute("showtables");

List<String>results=client.fetchAll();

for(Stringresult:results){

System.out.println(result);

}

}

MakesurethatHiveServerisrunningonport11111,andifnot,startaninstancewiththefollowingcommand:

$sudohive--servicehiveserver-p11111

CompileandexecutetheHiveThriftClient.javaexamplewith:

$javac$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*

com/learninghadoop2/hive/client/HiveThriftClient.java

$java-cp$(hadoopclasspath):/opt/cloudera/parcels/CDH/lib/hive/lib/*:

com.learninghadoop2.hive.client.HiveThriftClient

Page 308: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StingerinitiativeHivehasremainedverysuccessfulandcapablesinceitsearliestreleases,particularlyinitsabilitytoprovideSQL-likeprocessingonenormousdatasets.Butothertechnologiesdidnotstandstill,andHiveacquiredareputationofbeingrelativelyslow,particularlyinregardtolengthystartuptimesonlargejobsanditsinabilitytogivequickresponsestoconceptuallysimplequeries.

TheseperceivedlimitationswerelessduetoHiveitselfandmoreaconsequenceofhowtranslationofSQLqueriesintotheMapReducemodelhasmuchbuilt-ininefficiencywhencomparedtootherwaysofimplementingaSQLquery.Particularlyinregardtoverylargedatasets,MapReducesawlotsofI/O(andconsequentlytime)spentwritingouttheresultsofoneMapReducejobjusttohavethemreadbyanother.AsdiscussedinChapter3,Processing–MapReduceandBeyond,thisisamajordriverinthedesignofTez,whichcanschedulejobsonaHadoopclusterasagraphoftasksthatdoesnotrequireinefficientwritesandreadsbetweenthem.

ThefollowingisaqueryontheMapReduceframeworkversusTez:

SELECTa.country,COUNT(b.place_id)FROMplaceaJOINtweetsbON(a.

place_id=b.place_id)GROUPBYa.country;

ThefollowingfigurecontraststheexecutionplanfortheprecedingqueryontheMapReduceframeworkversusTez:

HiveonMapReduceversusTez

InplainMapReduce,twojobsarecreatedfortheGROUPBYandJOINclauses.ThefirstjobiscomposedofasetofMapReducetasksthatreaddatafromthedisktocarryoutgrouping.Thereducerswriteintermediateresultstothedisksothatoutputcanbesynchronized.Themappersinthesecondjobreadtheintermediateresultsfromthediskaswellasdatafromtableb.Thecombineddatasetisthenpassedtothereducerwheresharedkeysarejoined.WerewetoexecuteanORDERBYstatement,thiswouldhaveresultedina

Page 309: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

thirdjobandfurtherMapReducepasses.ThesamequeryisexecutedonTezasasinglejobbyasinglesetofMaptasksthatreaddatafromthedisk.I/Ogroupingandjoiningarepipelinedacrossreducers.

Alongsidethesearchitecturallimitations,therewerequiteafewareasaroundSQLlanguagesupportthatcouldalsoprovidebetterefficiency,andinearly2013,theStingerinitiativewaslaunchedwithanexplicitgoalofmakingHiveover100timesasfastandwithmuchricherSQLsupport.Hive0.13hasallthefeaturesofthethreephasesofStinger,resultinginamuchmorecompleteSQLdialect.Also,TezisofferedasanexecutionframeworkinadditiontoaMapReduce-basedimplementationatopYARNwhichismoreefficientthanpreviousimplementationsonHadoop1MapReduce.

WithTezastheexecutionengine,HiveisnolongerlimitedtoaseriesoflinearMapReducejobsandcaninsteadbuildaprocessinggraphwhereanygivenstepcan,forexample,streamresultstomultiplesub-steps.

TotakeadvantageoftheTezframework,thereisanewhivevariablesetting:

sethive.execution.engine=tez;

ThissettingreliesonTezbeinginstalledonthecluster;itisavailableinsourceformfromhttp://tez.apache.orgorinseveraldistributions,thoughatthetimeofwriting,notCloudera.

Thealternativevalueismr,whichusestheclassicMapReducemodel(atopYARN),soitispossibleinasingleinstallationtocomparewiththeperformanceofHiveusingTez.

Page 310: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ImpalaHiveisnottheonlyproductprovidingSQL-on-Hadoopcapability.ThesecondmostwidelyusedislikelyImpala,announcedinlate2012andreleasedinspring2013.ThoughoriginallydevelopedinternallywithinCloudera,itssourcecodeisperiodicallypushedtoanopensourceGitrepository(https://github.com/cloudera/impala).

ImpalawascreatedoutofthesameperceptionofHive’sweaknessesthatledtotheStingerinitiative.

ImpalaalsotooksomeinspirationfromGoogleDremel(http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdfwhichwasfirstopenlydescribedbyapaperpublishedin2009.DremelwasbuiltatGoogletoaddressthegapbetweentheneedforveryfastqueriesonverylargedatasetsandthehighlatencyinherentintheexistingMapReducemodelunderpinningHiveatthetime.Dremelwasasophisticatedapproachtothisproblemthat,ratherthanbuildingmitigationsatopMapReducesuchasimplementedbyHive,insteadcreatedanewservicethataccessedthesamedatastoredinHDFS.Dremelalsobenefitedfromsignificantworktooptimizethestorageformatofthedatainawaythatmadeitmoreamenabletoveryfastanalyticqueries.

Page 311: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ThearchitectureofImpalaThebasicarchitecturehasthreemaincomponents;theImpaladaemons,thestatestore,andtheclients.Recentversionshaveaddedadditionalcomponentsthatimprovetheservice,butwe’llfocusonthehigh-levelarchitecture.

TheImpaladaemon(impalad)shouldberunoneachhostwhereaDataNodeprocessismanagingHDFSdata.NotethatimpaladdoesnotaccessthefilesystemblocksthroughthefullHDFSFileSystemAPI;instead,itusesafeaturecalledshort-circuitreadstomakedataaccessmoreefficient.

Whenaclientsubmitsaquery,itcandosotoanyoftherunningimpaladprocesses,andthisonewillbecomethecoordinatorfortheexecutionofthatquery.ThekeyaspectofImpala’sperformanceisthatforeachquery,itgeneratescustomnativecode,whichisthenpushedtoandexecutedbyalltheimpaladprocessesonthesystem.Thishighlyoptimizedcodeperformsthequeryonthelocaldata,andeachimpaladthenreturnsitssubsetoftheresultsettothecoordinatornode,whichperformsthefinaldataconsolidationtoproducethefinalresult.Thistypeofarchitectureshouldbefamiliartoanyonewhohasworkedwithanyofthe(usuallycommercialandexpensive)MassivelyParallelProcessing(MPP)(thetermusedforthistypeofsharedscale-outarchitecture)datawarehousesolutionsavailabletoday.Astheclusterruns,thestatestoredaemonensuresthateachimpaladprocessisawareofalltheothersandprovidesaviewoftheoverallclusterhealth.

Page 312: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Co-existingwithHiveImpala,asanewerproduct,tendstohaveamorerestrictedsetofSQLdatatypesandsupportsamoreconstraineddialectofSQLthanHive.Itis,however,expandingthissupportwitheachnewrelease.RefertotheImpaladocumentation(http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/impala.html)togetanoverviewofthecurrentlevelofsupport.

ImpalasupportstheHivemetastoremechanismusedbyHivetopersistentlystorethemetadatasurroundingitstablestructureandstorage.ThismeansthatonaclusterwithanexistingHivesetup,itshouldbeimmediatelypossibletouseImpalaasitwillaccessthesamemetastoreandthereforeprovideaccesstothesametablesavailableinHive.

ButbewarnedthatthedifferencesinSQLdialectanddatatypesmightcauseunexpectedresultswhenworkinginacombinedHiveandImpalaenvironment.Somequeriesmightworkononebutnottheother,theymightshowverydifferentperformancecharacteristics(moreonthislater),ortheymightactuallygivedifferentresults.Thislastpointmightbecomeapparentwhenusingdatatypessuchasfloatanddoublethataresimplytreateddifferentlyintheunderlyingsystems(HiveisimplementedonJavawhileImpalaiswritteninC++).

Asofversion1.2,itsupportsUDFswrittenbothinC++andJava,althoughC++isstronglyrecommendedasamuchfastersolution.KeepthisinmindifyouarelookingtosharecustomfunctionsbetweenHiveandImpala.

Page 313: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AdifferentphilosophyWhenImpalawasfirstreleased,itsgreatestbenefitwasinhowittrulyenabledwhatisoftencalledspeedofthoughtanalysis.Queriescouldbereturnedsufficientlyfastthatananalystcouldexploreathreadofanalysisinacompletelyinteractivefashionwithouthavingtowaitforminutesatatimeforeachquerytocomplete.It’sfairtosaythatmostadoptersofImpalawereattimesstunnedbyitsperformance,especiallywhencomparedtotheversionofHiveshippingatthetime.

TheImpalafocushasremainedmostlyontheseshorterqueries,andthisdoesimposesomelimitationsonthesystem.Impalatendstobequitememory-heavyasitreliesonin-memoryprocessingtoachievemuchofitsperformance.Ifaqueryrequiresadatasettobeheldinmemoryratherthanbeingavailableontheexecutingnode,thenthatquerywillsimplyfailinversionsofImpalabefore2.0.

ComparingtheworkonStingertoImpala,itcouldbearguedthatImpalahasamuchstrongerfocusonexcellingintheshorter(andarguablymorecommon)queriesthatsupportinteractivedataanalysis.ManybusinessintelligencetoolsandservicesarenowcertifiedtodirectlyrunonImpala.TheStingerinitiativehasputlesseffortintomakingHivejustasfastintheareawhereImpalaexcelsbuthasinsteadimprovedHive(tovaryingdegrees)forallworkloads.ImpalaisstilldevelopingatafastpaceandStingerhasputadditionalmomentumintoHive,soitismostlikelywisetoconsiderbothproductsanddeterminewhichbestmeetstheperformanceandfunctionalityrequirementsofyourprojectsandworkflows.

ItshouldalsobekeptinmindthattherearecompetitivecommercialpressuresshapingthedirectionofImpalaandHive.ImpalawascreatedandisstilldrivenbyCloudera,themostpopularvendorofHadoopdistributions.TheStingerinitiative,thoughcontributedtobymanycompaniesasdiverseasMicrosoft(yes,really!)andIntel,wasleadbyHortonworks,probablythesecondlargestvendorofHadoopdistributions.ThefactisthatifyouareusingtheClouderadistributionofHadoop,thensomeofthecorefeaturesofHivemightbeslowertoarrive,whereasImpalawillalwaysbeup-to-date.Conversely,ifyouuseanotherdistribution,youmightgetthelatestHiverelease,butthatmighteitherhaveanolderImpalaor,asiscurrentlythecase,youmighthavetodownloadandinstallityourself.

AsimilarsituationhasarisenwiththeParquetandORCfileformatsmentionedearlier.ParquetispreferredbyImpalaanddevelopedbyagroupofcompaniesledbyCloudera,whileORCispreferredbyHiveandischampionedbyHortonworks.

Unfortunately,therealityisthatParquetsupportisoftenveryquicktoarriveintheClouderadistributionbutlesssoinsaytheHortonworksdistribution,wheretheORCfileformatispreferred.

Thesethemesarealittleconcerningsince,althoughcompetitioninthisspaceisagoodthing,andarguablytheannouncementofImpalahelpedenergizetheHivecommunity,thereisagreaterriskthatyourchoiceofdistributionmighthavealargerimpactonthe

Page 314: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

toolsandfileformatsthatwillbefullysupported,unlikeinthepast.Hopefully,thecurrentsituationisjustanartifactofwhereweareinthedevelopmentcyclesofallthesenewandimprovedtechnologies,butdoconsideryourchoiceofdistributioncarefullyinrelationtoyourSQL-on-Hadoopneeds.

Page 315: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Drill,Tajo,andbeyondYoushouldalsoconsiderthatSQLonHadoopnolongeronlyreferstoHiveorImpala.ApacheDrill(http://drill.apache.org)isafullerimplementationoftheDremelmodelfirstdescribedbyGoogle.AlthoughImpalaimplementstheDremelarchitectureacrossHDFSdata,Drilllookstoprovidesimilarfunctionalityacrossmultipledatasources.Itisstillinitsearlystages,butifyourneedsarebroaderthanwhatHiveorImpalaprovides,itmightbeworthconsidering.

Tajo(http://tajo.apache.org)isanotherApacheprojectthatseekstobeafulldatawarehousesystemonHadoopdata.WithanarchitecturesimilartothatofImpala,itoffersamuchrichersystemwithcomponentssuchasmultipleoptimizersandETLtoolsthatarecommonplaceintraditionaldatawarehousesbutlessfrequentlybundledintheHadoopworld.Ithasamuchsmalleruserbasebuthasbeenusedbycertaincompaniesverysuccessfullyforasignificantlengthoftime,andmightbeworthconsideringifyouneedafullerdatawarehousingsolution.

Otherproductsarealsoemerginginthisspace,andit’sagoodideatodosomeresearch.HiveandImpalaareawesometools,butifyoufindthattheydon’tmeetyourneeds,thenlookaround—somethingelsemight.

Page 316: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryInitsearlydays,Hadoopwassometimeserroneouslyseenasthelatestsupposedrelationaldatabasekiller.Overtime,ithasbecomemoreapparentthatthemoresensibleapproachistoviewitasacomplementtoRDBMStechnologiesandthat,infact,theRDBMScommunityhasdevelopedtoolssuchasSQLthatarealsovaluableintheHadoopworld.

HiveQLisanimplementationofSQLonHadoopandwastheprimaryfocusofthischapter.InregardtoHiveQLanditsimplementations,wecoveredthefollowingtopics:

HowHiveQLprovidesalogicalmodelatopdatastoredinHDFSincontrasttorelationaldatabaseswherethetablestructureisenforcedinadvanceHowHiveQLsupportsmanystandardSQLdatatypesandcommandsincludingjoinsandviewsTheETL-likefeaturesofferedbyHiveQL,includingtheabilitytoimportdataintotablesandoptimizethetablestructurethroughpartitioningandsimilarmechanismsHowHiveQLofferstheabilitytoextenditscoresetofoperatorswithuser-definedcodeandhowthiscontraststothePigUDFmechanismTherecenthistoryofHivedevelopments,suchastheStingerinitiative,thathaveseenHivetransitiontoanupdatedimplementationthatusesTezThebroaderecosystemaroundHiveQLthatnowincludesproductssuchasImpala,TajoandDrillandhoweachofthesefocusesonspecificareasinwhichtoexcel

WithPigandHive,we’veintroducedalternativemodelstoprocessMapReducedata,butsofarwe’venotlookedatanotherquestion:whatapproachesandtoolsarerequiredtoactuallyallowthismassivedatasetbeingcollectedinHadooptoremainusefulandmanageableovertime?Inthenextchapter,we’lltakeaslightstepuptheabstractionhierarchyandlookathowtomanagethelifecycleofthisenormousdataasset.

Page 317: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter8.DataLifecycleManagementOurpreviouschapterswerequitetechnologyfocused,describingparticulartoolsortechniquesandhowtheycanbeused.Inthisandthenextchapter,wearegoingtotakeamoretop-downapproachwherebywewilldescribeaproblemspaceyouarelikelytoencounterandthenexplorehowtoaddressit.Inparticular,we’llcoverthefollowingtopics:

WhatwemeanbythetermdatalifecyclemanagementWhydatalifecyclemanagementissomethingtothinkaboutThecategoriesoftoolsthatcanbeusedtoaddresstheproblemHowtousethesetoolstobuildthefirsthalfofaTwittersentimentanalysispipeline

Page 318: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WhatdatalifecyclemanagementisDatadoesn’texistonlyatapointintime.Particularlyforlong-runningproductionworkflows,youarelikelytoacquireasignificantquantityofdatainaHadoopcluster.Requirementsrarelystaystaticforlong,soalongsidenewlogicyoumightalsoseetheformatofthatdatachangeorrequiremultipledatasourcestobeusedtoprovidethedatasetprocessedinyourapplication.Weusethetermdatalifecyclemanagementtodescribeanapproachtohandlingthecollection,storage,andtransformationofdatathatensuresthatdataiswhereitneedstobe,intheformatitneedstobein,inawaythatallowsdataandsystemevolutionovertime.

Page 319: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ImportanceofdatalifecyclemanagementIfyoubuilddataprocessingapplications,youarebydefinitionreliantonthedatathatisprocessed.Justasweconsiderthereliabilityofapplicationsandsystems,itbecomesnecessarytoensurethatthedataisalsoproduction-ready.

DataatsomepointneedstobeingestedintoHadoop.Itisonepartofanenterpriseandoftenhasmultiplepointsofintegrationwithexternalsystems.Iftheingestofdatacomingfromthosesystemsisnotreliable,thentheimpactonthejobsthatprocessthatdataisoftenasdisruptiveasamajorsystemfailure.Dataingestbecomesacriticalcomponentinitsownright.Andwhenwesaytheingestneedstobereliable,wedon’tjustmeanthatdataisarriving;italsohastobearrivinginaformatthatisusableandthroughamechanismthatcanhandleevolutionovertime.

Theproblemwithmanyoftheseissuesisthattheydonotariseinasignificantfashionuntiltheflowsarelarge,thesystemiscritical,andthebusinessimpactofanyproblemsisnon-trivial.Adhocapproachesthatworkedforalesscriticaldataflowoftenwillsimplynotscale,butwillbeverypainfultoreplaceonalivesystem.

Page 320: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ToolstohelpButdon’tpanic!Thereareanumberofcategoriesoftoolsthatcanhelpwiththedatalifecyclemanagementproblem.We’llgiveexamplesofthefollowingthreebroadcategoriesinthischapter:

Orchestrationservices:buildinganingestpipelineusuallyhasmultiplediscretestages,andwewilluseanorchestrationtooltoallowthesetobedescribed,executed,andmanagedConnectors:giventheimportanceofintegrationwithexternalsystems,wewilllookathowwecanuseconnectorstosimplifytheabstractionsprovidedbyHadoopstorageFileformats:howwestorethedataimpactshowwemanageformatevolutionovertime,andseveralrichstorageformatshavewaysofsupportingthis

Page 321: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BuildingatweetanalysiscapabilityInearlierchapters,weusedvariousimplementationsofTwitterdataanalysistodescribeseveralconcepts.Wewilltakethiscapabilitytoadeeperlevelandapproachitasamajorcasestudy.

Inthischapter,wewillbuildadataingestpipeline,constructingaproduction-readydataflowthatisdesignedwithreliabilityandfutureevolutioninmind.

We’llbuildoutthepipelineincrementallythroughoutthechapter.Ateachstage,we’llhighlightwhathaschangedbutcan’tincludefulllistingsateachstagewithouttreblingthesizeofthechapter.Thesourcecodeforthischapter,however,haseveryiterationinitsfullglory.

Page 322: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingthetweetdataThefirstthingweneedtodoisgettheactualtweetdata.Asinpreviousexamples,wecanpassthe-jand-nargumentstostream.pytodumpJSONtweetstostdout:

$stream.py-j-n10000>tweets.json

Sincewehavethistoolthatcancreateabatchofsampletweetsondemand,wecouldstartouringestpipelinebyhavingthisjobrunonaperiodicbasis.Buthow?

Page 323: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

IntroducingOozieWecould,ofcourse,bangrockstogetherandusesomethinglikecronforsimplejobscheduling,butrecallthatwewantaningestpipelinethatisbuiltwithreliabilityinmind.So,wereallywantaschedulingtoolthatwecanusetodetectfailuresandotherwiserespondtoexceptionalsituations.

ThetoolwewillusehereisOozie(http://oozie.apache.org),aworkflowengineandschedulerbuiltwithafocusontheHadoopecosystem.

Oozieprovidesameanstodefineaworkflowasaseriesofnodeswithconfigurableparametersandcontrolledtransitionfromonenodetothenext.ItisinstalledaspartoftheClouderaQuickStartVM,andthemaincommand-lineclientis,notsurprisingly,calledoozie.

NoteWe’vetestedtheworkflowsinthischapteragainstversion5.0oftheClouderaQuickStartVM,andatthetimeofwritingOozieinthelatestversion,5.1,hassomeissues.There’snothingparticularlyversion-specificinourworkflows,however,sotheyshouldbecompatiblewithanycorrectlyworkingOoziev4implementation.

Thoughpowerfulandflexible,Ooziecantakealittlegettingusedto,sowe’llgivesomeexamplesanddescribewhatwearedoingalongtheway.

ThemostcommonnodeinanOozieworkflowisanaction.Itiswithinactionnodesthatthestepsoftheworkflowareactuallyexecuted;theothernodetypeshandlemanagementoftheworkflowintermsofdecisions,parallelism,andfailuredetection.Ooziehasmultipletypesofactionsthatitcanperform.Oneoftheseistheshellaction,whichcanbeusedtoexecuteanycommandonthesystem,suchasnativebinaries,shellscripts,oranyothercommand-lineutility.Let’screateascripttogenerateafileoftweetsandcopythistoHDFS:

set-e

sourcetwitter.keys

pythonstream.py-j-n500>/tmp/tweets.out

hdfsdfs-put/tmp/tweets.out/tmp/tweets/tweets.out

rm-f/tmp/tweets.out

Notethatthefirstlinewillcausetheentirescripttofailshouldanyoftheincludedcommandsfail.WeuseanenvironmentfiletoprovidetheTwitterkeystoourscriptintwitter.keys,whichisofthefollowingform:

exportTWITTER_CONSUMER_KEY=<value>

exportTWITTER_CONSUMER_SECRET=<value>

exportTWITTER_ACCESS_KEY=<value>

exportTWITTER_ACCESS_SECRET=<value>

OozieusesXMLtodescribeitsworkflows,usuallystoredinafilecalledworkflow.xml.Let’swalkthroughthedefinitionforanOozieworkflowthatcallsashellcommand.

Page 324: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheschemaforanOozieworkflowiscalledworkflow-app,andwecangivetheworkflowaspecificname.ThisisusefulwhenviewingjobhistoryintheCLIorOoziewebUI.Intheexamplesinthisbook,we’lluseanincreasingversionnumbertoallowustomoreeasilyseparatetheiterationswithinthesourcerepository.Thisishowwegivetheworkflow-appaspecificname:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="v1">

Oozieworkflowsaremadeupofaseriesofconnectednodes,eachofwhichrepresentsastepintheprocess,andwhicharerepresentedbyXMLnodesintheworkflowdefinition.Ooziehasanumberofnodesthatdealwiththetransitionoftheworkflowfromonesteptothenext.Thefirstoftheseisthestartnode,whichsimplystatesthenameofthefirstnodetobeexecutedaspartoftheworkflow,asfollows:

<startto="fs-node"/>

Wethenhavethedefinitionforthenamedstartnode.Inthiscase,itisanactionnode,whichisthegenericnodetypeformostOozienodesthatactuallyperformsomeprocessing,asfollows:

<actionname="fs-node">

Actionisabroadcategoryofnodes,andwewilltypicallythenspecializeitwiththeparticularprocessingforthisgivennode.Inthiscase,weareusingthefsnodetype,whichallowsustoperformfilesystemoperations:

<fs>

WewanttoensurethatthedirectoryonHDFStowhichwewishtocopythefileoftweetdata,exists,isempty,andhassuitablepermissions.Wedothisbytryingtodeletethedirectoryifitexists,thencreatingit,andfinallyapplyingtherequiredpermissions,asfollows:

<deletepath="${nameNode}/tmp/tweets"/>

<mkdirpath="${nameNode}/tmp/tweets"/>

<chmodpath="${nameNode}/tmp/tweets"permissions="777"/>

</fs>

We’llseeanalternativewayofsettingupdirectorieslater.Afterperformingthefunctionalityofthenode,Oozieneedsknowhowtoproceedwiththeworkflow.Inmostcases,thiswillcomprisemovingtoanotheractionnodeifthisnodewassuccessfulandabortingtheworkflowotherwise.Thisisspecifiedbythenextelements.Theoknodegivesthenameofthenodetowhichtotransitioniftheexecutionwassuccessful;theerrornodenamesthedestinationnodeforfailurescenarios.Here’showtheokandfailnodesareused:

<okto="shell-node"/>

<errorto="fail"/>

</action>

<actionname="shell-node">

Thesecondactionnodeisagainspecializedwithitsspecificprocessingtype;inthiscase,

Page 325: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

wehaveashellnode:

<shellxmlns="uri:oozie:shell-action:0.2">

TheshellactionthenhastheHadoopJobTrackerandNameNodelocationsspecified.Notethattheactualvaluesaregivenbyvariables;we’llexplainwheretheycomefromlater.TheJobTrackerandNameNodearespecifiedasfollows:

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

AsmentionedinChapter3,Processing–MapReduceandBeyond,MapReduceusesmultiplequeuestoprovidesupportfordifferentapproachestoresourcescheduling.ThenextelementspecifiestheMapReducequeuetowhichtheworkflowshouldbesubmitted:

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

Nowthattheshellnodeisfullyconfigured,wecanspecifythecommandtoinvoke,againviaavariable,asfollows:

<exec>${EXEC}</exec>

ThevariousstepsofOozieworkflowsareexecutedasMapReducejobs.Thisshellactionwill,therefore,beexecutedasaspecifictaskinstanceonaparticularTaskTracker.We,therefore,needtospecifywhichfilesneedtobecopiedtothelocalworkingdirectoryontheTaskTrackermachinebeforetheactioncanbeperformed.Inthiscase,weneedtocopythemainshellscript,thePythontweetgenerator,andtheTwitterconfigfile,asfollows:

<file>${workflowRoot}/${EXEC}</file>

<file>${workflowRoot}/twitter.keys</file>

<file>${workflowRoot}/stream.py</file>

Afterclosingtheshellelement,weagainspecifywhattododependingonwhethertheactioncompletedsuccessfullyornot.BecauseMapReduceisusedforjobexecution,themajorityofnodetypesbydefinitionhavebuilt-inretryandrecoverylogic,thoughthisisnotthecaseforshellnodes:

</shell>

<okto="end"/>

<errorto="fail"/>

</action>

Iftheworkflowfails,let’sjustkillitinthiscase.Thekillnodetypedoesexactlythat—terminatetheworkflowfromproceedingtoanyfurthersteps,usuallyloggingerrormessagesalongtheway.Here’showthekillnodetypeisused:

<killname="fail">

<message>Shellactionfailed,error

message[${wf:errorMessage(wf:lastErrorNode())}]</message>

Page 326: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

</kill>

TheendnodeontheotherhandsimplyhaltstheworkflowandlogsitasasuccessfulcompletionwithinOozie:

<endname="end"/>

</workflow-app>

Theobviousquestioniswhattheprecedingvariablesrepresentandfromwheretheygettheirconcretevalues.TheprecedingvariablesareexamplesoftheOozieExpressionLanguageoftenreferredtoasEL.

Alongsidetheworkflowdefinitionfile(workflow.xml),whichdescribesthestepsintheflow,wealsoneedtocreateaconfigurationfilethatgivesthespecificvaluesforagivenexecutionoftheworkflow.Thisseparationoffunctionalityandconfigurationallowsustowriteworkflowsthatcanbeusedondifferentclusters,ondifferentfilelocations,orwithdifferentvariablevalueswithouthavingtorecreatetheworkflowitself.Byconvention,thisfileisusuallynamedjob.properties.Fortheprecedingworkflow,here’sasamplejob.propertiesfile.

Firstly,wespecifythelocationoftheJobTracker,theNameNode,andtheMapReducequeuetowhichtosubmittheworkflow.ThefollowingshouldworkontheCloudera5.0QuickStartVM,thoughinv5.1thehostnamehasbeenchangedtoquickstart.cloudera.TheimportantthingisthatthespecifiedNameNodeandJobTrackeraddressesneedtobeintheOoziewhitelist—thelocalservicesontheVMareaddedautomatically:

jobTracker=localhost.localdomain:8032

nameNode=hdfs://localhost.localdomain:8020

queueName=default

Next,wesetsomevaluesforwheretheworkflowdefinitionsandassociatedfilescanbefoundontheHDFSfilesystem.Notetheuseofavariablerepresentingtheusernamerunningthejob.Thisallowsasingleworkflowtobeappliedtodifferentpathsdependingonthesubmittinguser,asfollows:

tasksRoot=book

workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v1

oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v1

Next,wenamethecommandtobeexecutedintheworkflowas${EXEC}:

EXEC=gettweets.sh

Morecomplexworkflowswillrequireadditionalentriesinthejob.propertiesfile;theprecedingworkflowisassimpleasitgets.

Theooziecommand-linetoolneedstoknowwheretheOozieserverisrunning.ThiscanbeaddedasanargumenttoeveryOozieshellcommand,butthatgetsunwieldyveryquickly.Instead,youcansettheshellenvironmentvariable,asfollows:

$exportOOZIE_URL='http://localhost:11000/oozie'

Afterallthatwork,wecannowactuallyrunanOozieworkflow.Createadirectoryon

Page 327: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HDFSasspecifiedinthevaluesinthejob.propertiesfile.Intheprecedingcommand,we’dbecreatingthisasbook/v1underourhomedirectoryonHDFS.Copythestream.py,gettweets.shandtwitter.propertiesfilestothatdirectory;thesearethefilesrequiredtoperformtheactualexecutionoftheshellcommand.Then,addtheworkflow.xmlfiletothesamedirectory.

Toruntheworkflowthen,wedothefollowing:

$ooziejob-run-config<path-to-job.properties>

Ifsubmittedsuccessfully,Ooziewillprintthejobnametothescreen.Youcanseethecurrentstatusofthisworkflowwith:

$ooziejob-info<job-id>

Youcanalsocheckthelogsforthejob:

$ooziejob-log<job-id>

Inaddition,allcurrentandrecentjobscanbeviewedwith:

$ooziejobs

AnoteonHDFSfilepermissionsThereisasubtleaspectintheshellcommandthatcancatchtheunwary.Asanalternativetohavingthefsnode,wecouldinsteadincludeapreparationelementwithintheshellnodetocreatethedirectoryweneedonthefilesystem.Itwouldlooklikethefollowing:

<prepare>

<mkdirpath="${nameNode}/tmp/tweets"/>

</prepare>

Thepreparestageisexecutedbytheuserwhosubmittedtheworkflow,butsincetheactualscriptexecutionisperformedonYARN,itisusuallyexecutedastheyarnuser.Youmighthitaproblemwherethescriptgeneratesthetweets,the/tmp/tweetsdirectoryiscreatedonHDFS,butthescriptthenfailstohavepermissiontowritetothatdirectory.Youcaneitherresolvethisthroughassigningpermissionsmorepreciselyor,asshownearlier,youaddafilesystemnodetoencapsulatetheneededoperations.We’lluseamixtureofbothtechniquesinthischapter;fornon-shellnodes,we’lluseprepareelements,particularlyiftheneededdirectoryismanipulatedonlybythatnode.Forcaseswhereashellnodeisinvolvedorwherethecreateddirectorieswillbeusedacrossmultiplenodes,we’llbesafeandusethemoreexplicitfsnode.

MakingdevelopmentalittleeasierItcansometimesgetawkwardtomanagethefilesandresourcesforanOoziejobduringdevelopment.SomeneedtobeonHDFS,whilesomeneedtobelocal,andchangestosomefilesrequirechangestoothers.TheeasiestapproachisoftentodevelopormakechangesinacompletecloneoftheworkflowdirectoryonthelocalfilesystemandpushchangesfromtheretothesimilarlynameddirectoryinHDFS,notforgetting,ofcourse,toensurethatallchangesareunderrevisioncontrol!Foroperationalexecutionofthe

Page 328: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

workflow,thejob.propertiesfileistheonlythingthatneedstobeonthelocalfilesystemand,conversely,alltheotherfilesneedtobeonHDFS.Alwaysrememberthis:it’salltooeasytomakechangestoalocalcopyofaworkflow,forgettopushthechangestoHDFS,andthenbeconfusedastowhytheworkflowisn’treflectingthechanges.

ExtractingdataandingestingintoHiveWithourdataonHDFS,wecannowextracttheseparatedatasetsfortweetsandusers,andplacedataasinpreviouschapters.Wecanreuseextract_for_hive.pigtoparsetherawtweetJSONintoseparatefiles,storethemagainonHDFS,andthenfollowupwithaHivestepthatingeststhesenewfilesintoHivetablesfortweets,users,andplaces.

TodothiswithinOozie,we’llneedtoaddtwonewnodestoourworkflow,aPigactionforthefirststepandaHiveactionforthesecond.

ForourHiveaction,we’lljustcreatethreeexternaltablesthatpointtothefilesgeneratedbyPig.ThiswouldthenallowustofollowourpreviouslydescribedmodelofingestingintotemporaryorexternaltablesandusingHiveQLINSERTstatementsfromtheretoinsertintotheoperational,andoftenpartitioned,tables.Thiscreate.hqlscriptcanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch8/v2/hive/create.hqlbutissimplyofthefollowingform:

CREATEDATABASEIFNOTEXISTStwttr;

USEtwttr;

DROPTABLEIFEXISTStweets;

CREATEEXTERNALTABLEtweets(

...

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${ingestDir}/tweets';

DROPTABLEIFEXISTSuser;

CREATEEXTERNALTABLEuser(

...

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${ingestDir}/users';

DROPTABLEIFEXISTSplace;

CREATEEXTERNALTABLEplace(

...

)ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASTEXTFILE

LOCATION'${ingestDir}/places';

NotethatthefileseparatoroneachtableisalsoexplicitlysettomatchwhatweareoutputtingfromPig.Inadditiontothis,locationsinbothscriptsarespecifiedbyvariablesforwhichwewillprovideconcretevaluesinourjob.propertiesfile.

Withtheprecedingstatements,wecancreatethePignodeforourworkflowfoundinthe

Page 329: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

sourcecodeasv2ofthepipeline.Muchofthenodedefinitionlookssimilartotheshellnodeusedpreviously,aswesetthesameconfigurationelements;alsonoticeouruseoftheprepareelementtocreatetheneededoutputdirectory.WecancreatethePignodeforourworkflowasshowninthefollowingaction:

<actionname="pig-node">

<pig>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<prepare>

<deletepath="${nameNode}/${outputDir}"/>

<mkdirpath="${nameNode}/${outputDir}"/>

</prepare>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

Similarlyaswiththeshellcommand,weneedtotellthePigactionthelocationoftheactualPigscript.Thisisspecifiedinthefollowingscriptelement:

<script>${workflowRoot}/pig/extract_for_hive.pig</script>

WealsoneedtomodifythecommandlineusedtoinvokethePigscripttoaddseveralparameters.Thefollowingelementsdothis;notetheconstructionpatternwhereinoneelementaddstheactualparameternameandthenextitsvalue(we’llseeanalternativemechanismforpassingargumentsinthenextsection):

<argument>-param</argument>

<argument>inputDir=${inputDir}</argument>

<argument>-param</argument>

<argument>outputDir=${outputDir}</argument>

</pig>

BecausewewanttomovefromthissteptotheHivenode,weneedtosetthefollowingelementsappropriately:

<okto="hive-node"/>

<errorto="fail"/>

</action>

TheHiveactionitselfisalittledifferentthanthepreviousnodes;eventhoughitstartsinasimilarfashion,itspecifiestheHiveaction-specificnamespace,asfollows:

<actionname="hive-node">

<hivexmlns="uri:oozie:hive-action:0.2">

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

TheHiveactionneedsmanyoftheconfigurationelementsusedbyHiveitselfand,inmostcases,wecopythehive-site.xmlfileintotheworkflowdirectoryandspecifyitslocation,asshowninthefollowingxml;notethatthismechanismisnotHive-specificand

Page 330: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

canalsobeusedforcustomactions:

<job-xml>${workflowRoot}/hive-site.xml</job-xml>

Inaddition,wemightneedtooverridesomeMapReducedefaultconfigurationproperties,asshowninthefollowingxml,wherewespecifythatintermediatecompressionshouldbeusedforourjob:

<configuration>

<property>

<name>mapred.compress.map.output</name>

<value>true</value>

</property>

</configuration>

AfterconfiguringtheHiveenvironment,wenowspecifythelocationoftheHivescript:

<script>${workflowRoot}/hive/create.hql</script>

WealsohavetoprovidethemechanismtopassargumentstotheHivescript.Butinsteadofbuildingoutthecommandlineonecomponentatatime,we’lladdtheparamelementsthatmapthenameofaconfigurationelementinthejob.propertiesfiletovariablesspecifiedintheHivescript;thismechanismisalsosupportedwithPigactions:

<param>dbName=${dbName}</param>

<param>ingestDir=${ingestDir}</param>

</hive>

TheHivenodethenclosesastheothers,asfollows:

<okto="end"/>

<errorto="fail"/>

</action>

WenowneedtoputallthistogethertorunthemultistageworkflowinOozie.Thefullworkflow.xmlfilecanbefoundathttps://github.com/learninghadoop2/book-examples/tree/master/ch8/v2andtheworkflowisvisualizedinthefollowingdiagram:

Dataingestionworkflowv2

Page 331: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Thisworkflowperformsallthestepsdiscussedbefore;itgeneratestweetdata,extractssubsetsofdataviaPig,andtheningeststheseintoHive.

AnoteonworkflowdirectorystructureWenowhavequiteafewfilesinourworkflowdirectoryanditisbesttoadoptsomestructureandnamingconventions.Forthecurrentworkflow,ourdirectoryonHDFSlookslikethefollowing:

/hive/

/hive/create.hql

/lib/

/pig/

/pig/extract_for_hive.pig

/scripts/

/scripts/gettweets.sh

/scripts/stream-json-batch.py

/scripts/twitter-keys

/hive-site.xml

/job.properties

/workflow.xml

Themodelwefollowistokeepconfigurationfilesinthetop-leveldirectorybuttokeepfilesrelatedtoagivenactiontypeindedicatedsubdirectories.Notethatitisusefultohavealibdirectoryevenifempty,assomenodetypeslookforit.

Withtheprecedingstructure,thejob.propertiesfileforourcombinedjobisnowthefollowing:

jobTracker=localhost.localdomain:8032

nameNode=hdfs://localhost.localdomain:8020

queueName=default

tasksRoot=book

workflowRoot=${nameNode}/user/${user.name}/${tasksRoot}/v2

oozie.wf.application.path=${nameNode}/user/${user.name}/${tasksRoot}/v2

oozie.use.system.libpath=true

EXEC=gettweets.sh

inputDir=/tmp/tweets

outputDir=/tmp/tweetdata

ingestDir=/tmp/tweetdata

dbName=twttr

Intheprecedingcode,we’vefullyupdatedtheworkflow.xmldefinitiontoincludeallthestepsdescribedsofar—includinganinitialfsnodetocreatetherequireddirectorywithoutworryingaboutuserpermissions.

IntroducingHCatalogIfwelookatourcurrentworkflow,thereisinefficiencyinhowweuseHDFSastheinterfacebetweenPigandHive.WeneedtooutputtheresultofourPigscriptontoHDFS,wheretheHivescriptcanthenuseitasthelocationofsomenewtables.Whatthis

Page 332: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

highlightsisthatitisoftenveryusefultohavedatastoredinHive,butthisislimited,asfewtools(primarilyHive)canaccesstheHivemetastoreandhencereadandwritesuchdata.Ifwethinkaboutit,Hivehastwomainlayers:itstoolsforaccessingandmanipulatingitsdataplustheexecutionframeworktorunqueriesonthatdata.

TheHCatalogsubprojectofHiveeffectivelyprovidesanindependentimplementationofthefirstoftheselayers—themeanstoaccessandmanipulatedataintheHivemetastore.HCatalogprovidesmechanismsforothertools,suchasPigandMapReduce,tonativelyreadandwritetable-structureddatathatisstoredonHDFS.

Remember,ofcourse,thatthedataisstoredonHDFSinoneformatoranother.TheHivemetastoreprovidesthemodelstoabstractthesefilesintotherelationaltablestructurefamiliarfromHive.SowhenwesaywearestoringdatainHCatalog,whatwereallymeanisthatwearestoringdataonHDFSinsuchawaythatthisdatacanthenbeexposedbytablestructuresspecifiedwithintheHivemetastore.Conversely,whenwerefertoHivedata,whatwereallymeanisdatawhosemetadataisstoredintheHivemetastore,andwhichcanbeaccessedbyanymetastore-awaretool,suchasHCatalog.

UsingHCatalog

TheHCatalogcommand-linetooliscalledhcatandwillbepreinstalledontheClouderaQuickStartVM—itisinstalled,infact,withanyversionofHivelaterthan0.11inclusive.

Thehcatutilitydoesn’thaveaninteractivemode,sogenerallyyouwilluseitwithexplicitcommand-lineargumentsorbypointingitatafileofcommands,asfollows:

$hcat–e"usedefault;showtables"

$hcat–fcommands.hql

Thoughthehcattoolisusefulandcanbeincorporatedintoscripts,themoreinterestingelementofHCatalogforourpurposeshereisitsintegrationwithPig.HCatalogdefinesanewPigloadercalledHCatLoaderandastorercalledHCatStorer.Asthenamessuggest,theseallowPigscriptstoreadfromorwritetoHivetablesdirectly.WecanusethismechanismtoreplaceourpreviousPigandHiveactionsinourOozieworkflowwithasingleHCatalog-basedPigactionthatwritestheoutputofthePigjobdirectlyintoourtablesinHive.

Forclarity,we’llcreatenewtablesnamedtweets_hcat,places_hcat,andusers_hcatintowhichwe’llinsertthisdata;notethatthesearenolongerexternaltables:

CREATETABLEtweets_hcat…

CREATETABLEplaces_hcat…

CREATETABLEusers_hcat…

Notethatifwehadthesecommandsinascriptfile,wecouldusethehcatCLItooltoexecutethem,asfollows:

$hcat–fcreate.hql

TheHCatCLItooldoesnot,however,offeraninteractiveshellakintotheHiveCLI.WecannowuseourpreviousPigscriptandneedtoonlychangethestorecommands,replacingtheuseofPigStoragewithHCatStorer.OurupdatedPigscript,

Page 333: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

extract_to_hcat.pig,thereforeincludesstorecommandssuchasthefollowing:

storetweets_tsvinto'twttr.tweets_hcat'using

org.apache.hive.hcatalog.pig.HCatStorer();

NotethatthepackagenamefortheHCatStorerclasshastheorg.apache.hive.hcatalogprefix;whenHCatalogwasintheApacheincubator,itusedorg.apache.hcatalogforitspackageprefix.Thisolderformisnowdeprecated,andthenewformthatexplicitlyshowsHCatalogasasubprojectofHiveshouldbeusedinstead.

WiththisnewPigscript,wecannowreplaceourpreviousPigandHiveactionwithanupdatedPigactionusingHCatalog.ThisalsorequiresthefirstusageoftheOoziesharelib,whichwe’lldiscussinthenextsection.Inourworkflowdefinition,thepigelementofthisactionwillbedefinedasshowninthefollowingxmlandcanbefoundasv3ofthepipelineinthesourcebundle;inv3,we’vealsoaddedautilityHivenodetorunbeforethePignodetoensurethatallnecessarytablesexistbeforethePigscriptthatrequiresthemisexecuted.

<pig>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<job-xml>${workflowRoot}/hive-site.xml</job-xml>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

<property>

<name>oozie.action.sharelib.for.pig</name>

<value>pig,hcatalog</value>

</property>

</configuration>

<script>${workflowRoot}/pig/extract_to_hcat.pig

</script>

<argument>-param</argument>

<argument>inputDir=${inputDir}</argument>

</pig>

Thetwochangesofnotearetheadditionoftheexplicitreferencetothehive-site.xmlfile;thisisrequiredbyHCatalog,andthenewconfigurationelementthattellsOozietoincludetherequiredHCatalogJARs.

TheOoziesharelibThatlastadditiontouchedonanimportantaspectofOoziewe’venotmentionedthusfar:theOoziesharelib.WhenOozierunsallitsvariousactiontypes,itrequiresmultipleJARstoaccessHadoopandtoinvokevarioustools,suchasHiveandPig.AspartoftheOozieinstallation,alargenumberofdependentJARshavebeenplacedonHDFStobeusedbyOozieanditsvariousactiontypes:thisistheOoziesharelib.

FormostusagesofOozie,it’senoughtoknowthesharelibexists,usuallyunder/user/oozie/share/libonHDFS,andwhen,asinthepreviousexample,someexplicit

Page 334: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

configurationvaluesneedtobeadded.WhenusingaPigaction,thePigJARswillautomaticallygetpickedup,butwhenthePigscriptusessomethinglikeHCatalog,thenthisdependencywillnotbeexplicitlyknowntoOozie.

TheOozieCLIallowsmanipulationofthesharelib,thoughthescenarioswherethiswillberequiredareoutsideofthescopeofthisbook.ThefollowingcommandcanbeusefulthoughtoseewhichcomponentsareincludedintheOoziesharelib:

$oozieadmin-shareliblist

ThefollowingcommandisusefultoseetheindividualJARscomprisingaparticularcomponentwithinthesharelib,inthiscaseHCatalog:

$oozieadmin-shareliblisthcat

ThesecommandscanbeusefultoverifythattherequiredJARsarebeingincludedandtoseewhichspecificversionsarebeingused.

HCatalogandpartitionedtablesIfyourerunthepreviousworkflowasecondtime,itwillfail;digintothelogs,andyouwillseeHCatalogcomplainingthatitcannotwritetoatablethatalreadycontainsdata.ThisisacurrentlimitationofHCatalog;itviewstablesandpartitionswithintablesasimmutablebydefault.Hive,ontheotherhand,willaddnewdatatoatableorpartition;itsdefaultviewofatableisthatitismutable.

UpcomingchangestoHiveandHCatalogwillseethesupportofanewtablepropertythatwillcontrolthisbehaviorineithertool;forexample,thefollowingaddedtoatabledefinitionwouldallowtableappendsassupportedinHivetoday:

TBLPROPERTIES("immutable"="false")

ThisiscurrentlynotavailableintheshippingversionofHiveandHCatalog,however.Forustohaveaworkflowthataddsmoreandmoredataintoourtables,wethereforeneedtocreateanewpartitionforeachnewrunoftheworkflow.We’vemadethesechangesinv4ofourpipeline,wherewefirstrecreatethetableswithanintegerpartitionkey,asfollows:

CREATETABLEtweets_hcat(

…)

PARTITIONEDBY(partition_keyint)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASSEQUENCEFILE;

CREATETABLE`places_hcat`(

…)

partitionedby(partition_keyint)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASSEQUENCEFILE

TBLPROPERTIES("immutable"="false");

CREATETABLE`users_hcat`(

Page 335: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

…)

partitionedby(partition_keyint)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\u0001'

STOREDASSEQUENCEFILE

TBLPROPERTIES("immutable"="false");

ThePigHCatStorertakesanoptionalpartitiondefinitionandwemodifythestorestatementsinourPigscriptaccordingly;forexample:

storetweets_tsvinto'twttr.tweets_hcat'

usingorg.apache.hive.hcatalog.pig.HCatStorer(

'partition_key=$partitionKey');

WethenmodifyourPigactionintheworkflow.xmlfiletoincludethisadditionalparameter:

<script>${workflowRoot}/pig/extract_to_hcat.pig</script>

<param>inputDir=${inputDir}</param>

<param>partitionKey=${partitionKey}</param>

Thequestionisthenhowwepassthispartitionkeytotheworkflow.Wecouldspecifyitinthejob.propertiesfile,butbydoingsowewouldhitthesameproblemwithtryingtowritetoanexistingpartitiononthenextre-run.

Ingestionworkflowv4

Fornow,we’llpassthisasanexplicitargumenttotheinvocationoftheOozieCLIandexplorebetterwaystodothislater:

$ooziejob–run–configv4/job.properties–DpartitionKey=12345

NoteNotethataconsequenceofthisbehavioristhatrerunninganHCatworkflowwiththesameargumentswillfail.Beawareofthiswhentestingworkflowsorplayingwiththesamplecodefromthisbook.

Page 336: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ProducingderiveddataNowthatwehaveourmaindatapipelineestablished,thereismostlikelyaseriesofactionsthatwewishtotakeafterweaddeachnewadditionaldataset.Asasimpleexample,notethatwithourpreviousmechanismofaddingeachsetofuserdatatoaseparatepartition,theusers_hcattablewillcontainusersmultipletimes.Let’screateanewtableforuniqueusersandregeneratethiseachtimeweaddnewuserdata.

NotethatgiventheaforementionedlimitationsofHCatalog,we’lluseaHiveactionforthispurpose,asweneedtoreplacethedatainatable.

First,we’llcreateanewtableforuniqueuserinformation,asfollows:

CREATETABLEIFNOTEXISTS`unique_users`(

`user_id`string,

`name`string,

`description`string,

`screen_name`string)

ROWFORMATDELIMITED

FIELDSTERMINATEDBY'\t'

STOREDASsequencefile;

Inthistable,we’llonlystoretheattributesofauserthateitherneverchange(ID)orchangerarely(thescreenname,andsoon).WecanthenwriteasimpleHivestatementtopopulatethistablefromthefullusers_hcattable:

USEtwttr;

INSERTOVERWRITETABLEunique_users

SELECTDISTINCTuser_id,name,description,screen_name

FROMusers_hcat;

WecanthenaddanadditionalHiveactionnodethatcomesafterourpreviousPignodeintheworkflow.Whendoingthis,wediscoverthatourpatternofsimplygivingnodesnamessuchashive-nodeisareallybadidea,aswenowhavetwoHive-basednodes.Inv5oftheworkflow,weaddthisnewnodeandalsochangeournodestohavemoredescriptivenames:

Page 337: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Ingestionworkflowv5

PerformingmultipleactionsinparallelOurworkflowhastwotypesofactivity:initialsetupwiththenodesthatinitializethefilesystemandHivetables,andthefunctionalnodesthatperformactualprocessing.Ifwelookatthetwosetupnodeswehavebeenusing,itisobviousthattheyarequitedistinctandnotinterdependent.WecanthereforetakeadvantageofanOoziefeaturecalledforkandjoinnodestoexecutetheseactionsinparallel.Thestartofourworkflow.xmlfilenowbecomes:

<startto="setup-fork-node"/>

TheOozieforknodecontainsanumberofpathelements,eachofwhichspecifiesastartingnode.Eachofthesewillbelaunchedinparallel:

<forkname="setup-fork-node">

<pathstart="setup-filesystem-node"/>

<pathstart="create-tables-node"/>

</fork>

Eachofthespecifiedactionnodesisnodifferentfromanywehaveusedpreviously.Anactionnodecanlinktoaseriesofothernodes;theonlyrequirementisthateachparallelseriesofactionsmustendwithatransitiontothejoinnodeassociatedwiththeforknode,asfollows:

<actionname="setup-filesystem-node">

<okto="setup-join-node"/>

<errorto="fail"/>

</action>

<actionname="create-tables-node">

<okto="setup-join-node"/>

<errorto="fail"/>

</action>

Thejoinnodeitselfactsasthepointofcoordination;anyworkflowthathascompletedwillwaituntilallthepathsspecifiedintheforknodereachthispoint.Atthatpoint,theworkflowcontinuesatthenodespecifiedwithinthejoinnode.Here’showthejoinnodeisused:

<joinname="create-join-node"to="gettweets-node"/>

Intheprecedingcodeweomittedtheactiondefinitionsforspacepurposes,butthefullworkflowdefinitionisinv6:

Page 338: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Ingestionworkflowv6

CallingasubworkflowThoughthefork/joinmechanismmakestheprocessofparallelactionsmoreefficient,itdoesstilladdsignificantverbosityifweincludeitinourmainworkflow.xmldefinition.Conceptually,wehaveaseriesofactionsthatareperformingrelatedtasksrequiredbyourworkflowbutnotnecessarilypartofit.Forthisandsimilarcases,Oozieofferstheabilitytoinvokeasubworkflow.Theparentworkflowwillexecutethechildandwaitforittocomplete,withtheabilitytopassconfigurationelementsfromoneworkflowtotheother.

Thechildworkflowwillbeafullworkflowinitsownright,usuallystoredinadirectoryonHDFSwithalltheusualstructureweexpectforaworkflow,themainworkflow.xmlfile,andanyrequiredHive,Pig,orsimilarfiles.

WecancreateanewdirectoryonHDFScalledsetup-workflow,andinthiscreatethefilesrequiredonlyforourfilesystemandHivecreationactions.Thesubworkflowconfigurationfilewilllooklikethefollowing:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="create-workflow">

<startto="setup-fork-node"/>

<forkname="setup-fork-node">

<pathstart="setup-filesystem-node"/>

<pathstart="create-tables-node"/>

</fork>

<actionname="setup-filesystem-node">

</action>

<actionname="create-tables-node">

</action>

<joinname="create-join-node"to="end"/>

<killname="fail">

<message>Actionfailed,error

message[${wf:errorMessage(wf:lastErrorNode())}]</message>

</kill>

<endname="end"/>

</workflow-app>

Page 339: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Withthissubworkflowdefined,wethenmodifythefirstnodesofourmainworkflowtouseasubworkflownode,asinthefollowing:

<startto="create-subworkflow-node"/>

<actionname="create-subworkflow-node">

<sub-workflow>

<app-path>${subWorkflowRoot}</app-path>

<propagate-configuration/>

</sub-workflow>

<okto="gettweets-node"/>

<errorto="fail"/>

</action>

WewillspecifythesubWorkflowPathinthejob.propertiesofourparentworkflow,andthepropagate-configurationelementwillpasstheconfigurationoftheparentworkflowtothechild.

AddingglobalsettingsByextractingutilitynodesintosubworkflows,wecansignificantlyreduceclutterandcomplexityinourmainworkflowdefinition.Inv7ofouringestpipeline,we’llmakeoneadditionalsimplificationandaddaglobalconfigurationsection,asinthefollowing:

<workflow-appxmlns="uri:oozie:workflow:0.4"name="v7">

<global>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<job-xml>${workflowRoot}/hive-site.xml</job-xml>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

</global>

<startto="create-subworkflow-node"/>

Byaddingthisglobalconfigurationsection,weremovetheneedtospecifyanyofthesevaluesintheHiveandPignodesintheremainingworkflow(notethatcurrentlytheshellnodedoesnotsupporttheglobalconfigurationmechanism).Thiscandramaticallysimplifysomeofournodes;forexample,ourPignodeisnowasfollows:

<actionname="hcat-ingest-node">

<pig>

<configuration>

<property>

<name>oozie.action.sharelib.for.pig</name>

<value>pig,hcatalog</value>

</property>

</configuration>

<script>${workflowRoot}/pig/extract_to_hcat.pig</script>

<param>inputDir=${inputDir}</param>

<param>dbName=${dbName}</param>

<param>partitionKey=${partitionKey}</param>

Page 340: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

</pig>

<okto="derived-data-node"/>

<errorto="fail"/>

</action>

Ascanbeseen,wecanaddadditionalconfigurationelements,orindeedoverridethosespecifiedintheglobalsection,resultinginamuchcleareractiondefinitionthatfocusesonlyontheinformationspecifictotheactioninquestion.Ourworkflowv7hashadbothaglobalsectionaddedaswellastheadditionofthesubworkflow,andthismakesasignificantimprovementintheworkflowreadability:

Ingestionworkflowv7

Page 341: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ChallengesofexternaldataWhenwerelyonexternaldatatodriveourapplication,weareimplicitlydependentonthequalityandstabilityofthatdata.Thisis,ofcourse,trueforanydata,butwhenthedataisgeneratedbyanexternalsourceoverwhichwedonothavecontrol,therisksaremostlikelyhigher.Regardless,whenbuildingwhatweexpecttobereliableapplicationsontopofsuchdatafeeds,andespeciallywhenourdatavolumesgrow,weneedtothinkabouthowtomitigatetheserisks.

Page 342: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DatavalidationWeusethegeneraltermdatavalidationtorefertotheactofensuringthatincomingdatacomplieswithourexpectationsandpotentiallyapplyingnormalizationtomodifyitaccordinglyortoevendeletemalformedorcorruptinput.Whatthisactuallyinvolveswillbeveryapplication-specific.Insomecases,theimportantthingisensuringthesystemonlyingestsdatathatconformstoagivendefinitionofaccurateorclean.Forourtweetdata,wedon’tcareabouteverysinglerecordandcouldveryeasilyadoptapolicysuchasdroppingrecordsthatdon’thavevaluesinparticularfieldswecareabout.Forotherapplications,however,itisimperativetocaptureeveryinputrecord,andthismightdrivetheimplementationoflogictoreformateveryrecordtomakesureitcomplieswiththerequirements.Inyetothercases,onlycorrectrecordswillbeingested,buttherest,insteadofbeingdiscarded,mightbestoredelsewhereforlateranalysis.

Thebottomlineisthattryingtodefineagenericapproachtodatavalidationisvastlybeyondthescopeofthischapter.

However,wecanoffersomethoughtsonwhereinthepipelinetoincorporatevarioustypesofvalidationlogic.

ValidationactionsLogictodoanynecessaryvalidationorcleanupcanbeincorporateddirectlyintootheractions.Ashellnoderunningascripttogatherdatacanhavecommandsaddedtohandlemalformedrecordsdifferently.PigandHiveactionsthatloaddataintotablescaneitherperformfilteringoningest(easierdoneinPig)oraddcaveatswhencopyingdatafromaningesttabletotheoperationalstore.

Thereisanargumentthoughfortheadditionofavalidationnodeintotheworkflow,evenifinitiallyitperformsnoactuallogic.Thiscould,forinstance,beaPigactionthatreadsthedata,appliesthevalidation,andwritesthevalidateddatatoanewlocationtobereadbyfollow-onnodes.Theadvantagehereisthatwecanlaterupdatethevalidationlogicwithoutalteringourotheractions,whichshouldreducetheriskofaccidentallybreakingtherestofthepipelineandalsomakenodesmorecleanlydefinedintermsofresponsibilities.Thenaturalextensionofthistrainofthoughtisthatanewsubworkflowforvalidationismostlikelyagoodmodelaswell,asitnotonlyprovidesseparationofresponsibilities,butalsomakesthevalidationlogiceasiertotestandupdate.

Theobviousdisadvantageofthisapproachisthatitaddsadditionalprocessingandanothercycleofreadingthedataandwritingitallagain.Thisis,ofcourse,directlyworkingagainstoneoftheadvantageswehighlightedwhenconsideringtheuseofHCatalogfromPig.

Intheend,itwillcomedowntoatrade-offofperformanceagainstworkflowcomplexityandmaintainability.Whenconsideringhowtoperformvalidationandjustwhatthatmeansforyourworkflow,takealltheseelementsintoaccountbeforedecidingonanimplementation.

Page 343: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HandlingformatchangesWecan’tdeclarevictoryjustbecausewehavedataflowingintooursystemandareconfidentthedataissufficientlyvalidated.Particularlywhenthedatacomesfromanexternalsourcewehavetothinkabouthowthestructureofthedatamightchangeovertime.

RememberthatsystemssuchasHiveonlyapplythetableschemawhenthedataisbeingread.Thisisahugebenefitinenablingflexibledatastorageandingest,butcanleadtouser-facingqueriesorworkloadsfailingsuddenlywhentheingesteddatanolongermatchesthequeriesbeingexecutedagainstit.Arelationaldatabase,whichappliesschemasonwrite,wouldnotevenallowsuchdatatobeingestedintothesystem.

Theobviousapproachtohandlingchangesmadetothedataformatwouldbetoreprocessexistingdataintothenewformat.Thoughthisistractableonsmallerdatasets,itquicklybecomesinfeasibleonthesortofvolumesseeninlargeHadoopclusters.

Page 344: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HandlingschemaevolutionwithAvroAvrohassomefeatureswithrespecttoitsintegrationwithHivethathelpuswiththisproblem.Ifwetakeourtablefortweetsdata,wecouldrepresentthestructureofatweetrecordbythefollowingAvroschema:

{

"namespace":"com.learninghadoop2.avrotables",

"type":"record",

"name":"tweets_avro",

"fields":[

{"name":"created_at","type":["null","string"]},

{"name":"tweet_id_str","type":["null","string"]},

{"name":"text","type":["null","string"]},

{"name":"in_reply_to","type":["null","string"]},

{"name":"is_retweeted","type":["null","string"]},

{"name":"user_id","type":["null","string"]},

{"name":"place_id","type":["null","string"]}

]

}

Createtheprecedingschemainafilecalledtweets_avro.avsc—thisisthestandardfileextensionforAvroschemas.Then,placeitonHDFS;weliketohaveacommonlocationforschemafilessuchas/schema/avro.

Withthisdefinition,wecannowcreateaHivetablethatusesthisschemaforitstablespecification,asfollows:

CREATETABLEtweets_avro

PARTITIONEDBY(`partition_key`int)

ROWFORMATSERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

WITHSERDEPROPERTIES(

'avro.schema.url'='hdfs://localhost.localdomain:8020/schema/avro/tweets_avr

o.avsc'

)

STOREDASINPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Then,lookatthetabledefinitionfromwithinHive(orHCatalog,whichalsosupportssuchdefinitions):

describetweets_avro

OK

created_atstringfromdeserializer

tweet_id_strstringfromdeserializer

textstringfromdeserializer

in_reply_tostringfromdeserializer

is_retweetedstringfromdeserializer

user_idstringfromdeserializer

place_idstringfromdeserializer

partition_keyintNone

Page 345: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Wecanalsousethistablelikeanyother,forexample,tocopythedatafrompartition3fromthenon-AvrotableintotheAvrotable,asfollows:

SEThive.exec.dynamic.partition.mode=nonstrict

INSERTINTOTABLEtweets_avro

PARTITION(partition_key)

SELECTFROMtweets_hcat

NoteJustasinpreviousexamples,ifAvrodependenciesarenotpresentintheclasspath,weneedtoaddtheAvroMapReduceJARtoourenvironmentbeforebeingabletoselectfromthetable.

WenowhaveanewtweetstablespecifiedbyanAvroschema;sofaritjustlookslikeothertables.ButtherealbenefitsforourpurposesinthischapterareinhowwecanusetheAvromechanismtohandleschemaevolution.Let’saddanewfieldtoourtableschema,asfollows:

{

"namespace":"com.learninghadoop2.avrotables",

"type":"record",

"name":"tweets_avro",

"fields":[

{"name":"created_at","type":["null","string"]},

{"name":"tweet_id_str","type":["null","string"]},

{"name":"text","type":["null","string"]},

{"name":"in_reply_to","type":["null","string"]},

{"name":"is_retweeted","type":["null","string"]},

{"name":"user_id","type":["null","string"]},

{"name":"place_id","type":["null","string"]},

{"name":"new_feature","type":"string","default":"wow!"}

]

}

Withthisnewschemainplace,wecanvalidatethatthetabledefinitionhasalsobeenupdated,asfollows:

describetweets_avro;

OK

created_atstringfromdeserializer

tweet_id_strstringfromdeserializer

textstringfromdeserializer

in_reply_tostringfromdeserializer

is_retweetedstringfromdeserializer

user_idstringfromdeserializer

place_idstringfromdeserializer

new_featurestringfromdeserializer

partition_keyintNone

Withoutaddinganynewdata,wecanrunqueriesonthenewfieldthatwillreturnthedefaultvalueforourexistingdata,asfollows:

SELECTnew_featureFROMtweets_avroLIMIT5;

...

Page 346: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OK

wow!

wow!

wow!

wow!

wow!

Evenmoreimpressiveisthefactthatthenewcolumndoesn’tneedtobeaddedattheend;itcanbeanywhereintherecord.Withthismechanism,wecannowupdateourAvroschemastorepresentthenewdatastructureandseethesechangesautomaticallyreflectedinourHivetabledefinitions.Anyqueriesthatrefertothenewcolumnwillretrievethedefaultvalueforallourexistingdatathatdoesnothavethatfieldpresent.

NotethatthedefaultmechanismweareusinghereiscoretoAvroandisnotspecifictoHive.Avroisaverypowerfulandflexibleformatthathasapplicationsinmanyareasandisdefinitelyworthdeeperexaminationthanwearegivingithere.

Technically,whatthisprovidesuswithisforwardcompatibility.Wecanmakechangestoourtableschemaandhaveallourexistingdataremainautomaticallycompliantwiththenewstructurewecan’t,however,continuetoingestdataoftheoldformatintotheupdatedtablessincethemechanismdoesnotprovidebackwardcompatibility:

INSERTINTOTABLEtweets_avro

PARTITION(partition_key)

SELECT*FROMtweets_hcat;

FAILED:SemanticException[Error10044]:Line1:18Cannotinsertinto

targettablebecausecolumnnumber/typesaredifferent'tweets_avro':Table

insclause-0has8columns,butqueryhas7columns.

SupportingschemaevolutionwithAvroallowsdatachangestobesomethingthatishandledaspartofnormalbusinessinsteadofthefirefightingemergencytheyalltoooftenturninto.Butplainly,it’snotforfree;thereisstillaneedtomakethechangesinthepipelineandrolltheseintoproduction.HavingHivetablesthatprovideforwardcompatibilitydoes,however,allowtheprocesstobeperformedinmoremanageablesteps;otherwise,youwouldneedtosynchronizechangesacrosseverystageofthepipeline.IfthechangesaremadefromingestuptothepointtheyareinsertedintoAvro-backedHivetables,thenallusersofthosetablescanremainunchanged(aslongastheydon’tdothingslikeselect*,whichisusuallyaterribleideaanyway)andcontinuetorunexistingqueriesagainstthenewdata.Theseapplicationscanthenbechangedonadifferenttimetabletotheingestionmechanism.Inourv8oftheingestpipeline,weshowhowtofullyuseAvrotablesforallofourexistingfunctionality.

NoteNotethatHive0.14,currentlyunreleasedatthetimeofwritingthis,willlikelyincludemorebuilt-insupportforAvrothatmightsimplifytheprocessofschemaevolutionevenfurther.IfHive0.14isavailablewhenyoureadthis,thendocheckoutthefinalimplementation.

FinalthoughtsonusingAvroschemaevolution

Page 347: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WiththisdiscussionofAvro,wehavetouchedonsomeaspectsofmuchbroadertopics,inparticularofdatamanagementonabroaderscaleandpoliciesarounddataversioningandretention.Muchofthisareabecomesveryspecifictoanorganization,buthereareafewpartingthoughtsthatwefeelaremorebroadlyapplicable.

Onlymakeadditivechanges

Wediscussedaddingcolumnsintheprecedingexample.Sometimes,thoughmorerarely,yoursourcedatadropscolumnsoryoudiscoveryounolongerneedanewcolumn.Avrodoesn’treallyprovidetoolstohelpwiththis,andwefeelitisoftenundesirable.Insteadofdroppingoldcolumns,wetendtomaintaintheolddataandsimplydonotusetheemptycolumnsinallthenewdata.Thisismucheasiertomanageifyoucontrolthedataformat;ifyouareingestingexternalsources,thentofollowthisapproachyouwilleitherneedtoreprocessdatatoremovetheoldcolumnorchangetheingestmechanismtoaddadefaultvalueforallnewdata.

Manageschemaversionsexplicitly

Intheprecedingexamples,wehadasingleschemafiletowhichwemadechangesdirectly.Thisislikelyaverybadidea,asitremovesourabilitytotrackschemachangesovertime.Inadditiontotreatingschemasasartifactstobekeptunderversioncontrol(yourschemasareinGittoo,aren’tthey?)itisoftenusefultotageachschemawithanexplicitversion.Thisisparticularlyusefulwhentheincomingdataisalsoexplicitlyversioned.Then,insteadofoverwritingtheexistingschemafile,youcanaddthenewfileanduseanALTERTABLEstatementtopointtheHivetabledefinitionatthenewschema.Weare,ofcourse,assumingherethatyoudon’thavetheoptionofusingadifferentqueryfortheolddatawiththedifferentformat.ThoughthereisnoautomaticmechanismforHivetoselectschema,theremightbecaseswhereyoucancontrolthismanuallyandsidesteptheevolutionquestion.

Thinkaboutschemadistribution

Whenusingaschemafile,thinkabouthowitwillbedistributedtotheclients.If,asinthepreviousexample,thefileisonHDFS,thenitlikelymakessensetogiveitahighreplicationfactor.ThefilewillberetrievedbyeachmapperineveryMapReducejobthatqueriesthetable.

TheAvroURLcanalsobespecifiedasalocalfilesystemlocation(file://),whichisusefulfordevelopmentandalsoasawebresource(http://).Thoughthelatterisveryusefulasitisaconvenientmechanismtodistributetheschematonon-Hadoopclients,rememberthattheloadonthewebservermightbehigh.Withmodernhardwareandefficientwebservers,thisismostlikelynotahugeconcern,butifyouhaveaclusterofthousandsofmachinesrunningmanyparalleljobswhereeachmapperneedstohitthewebserver,thenbecareful.

Page 348: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CollectingadditionaldataManydataprocessingsystemsdon’thaveasingledataingestsource;often,oneprimarysourceisenrichedbyothersecondarysources.Wewillnowlookathowtoincorporatetheretrievalofsuchreferencedataintoourdatawarehouse.

Atahighlevel,theproblemisn’tverydifferentfromourretrievaloftherawtweetdata,aswewishtopulldatafromanexternalsource,possiblydosomeprocessingonit,andstoreitsomewherewhereitcanbeusedlater.Butthisdoeshighlightanaspectweneedtoconsider;dowereallywanttoretrievethisdataeverytimeweingestnewtweets?Theansweriscertainlyno.Thereferencedatachangesveryrarely,andwecouldeasilyfetchitmuchlessfrequentlythannewtweetdata.Thisraisesaquestionwe’veskirteduntilnow:justhowdowescheduleOozieworkflows?

Page 349: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SchedulingworkflowsUntilnow,we’verunallourOozieworkflowsondemandfromtheCLI.OoziealsohasaschedulerthatallowsjobstobestartedeitheronatimedbasisorwhenexternalcriteriasuchasdataappearinginHDFSaremet.Itwouldbeagoodfitforourworkflowstohaveourmaintweetpipelinerun,say,every10minutesbutthereferencedataonlyrefresheddaily.

TipRegardlessofwhendataisretrieved,thinkcarefullyhowtohandledatasetsthatperformadelete/replaceoperation.Inparticular,don’tdothedeletebeforeretrievingandvalidatingthenewdata;otherwise,anyjobsthatrequirethereferencedatawillfailuntilthenextrunoftheretrievalsucceeds.Itcouldbeagoodoptiontoincludethedestructiveoperationsinasubworkflowthatisonlytriggeredaftersuccessfulcompletionoftheretrievalsteps.

Oozieactuallydefinestwotypesofapplicationsthatitcanrun:workflowssuchaswe’veusedsofarandcoordinators,whichscheduleworkflowstobeexecutedbasedonvariouscriteria.Acoordinatorjobisconceptuallysimilartoourotherworkflows;wepushanXMLconfigurationfileontoHDFSanduseaparameterizedpropertiesfiletoconfigureitatruntime.Inaddition,coordinatorjobshavethefacilitytoreceiveadditionalparameterizationfromtheeventsthattriggertheirexecution.

Thisispossiblybestdescribedbyanexample.Let’ssay,wewishtodoaspreviouslymentionedandcreateacoordinatorthatexecutesv7ofouringestworkflowevery10minutes.Here’sthecoordinator.xmlfile(thestandardnameforthecoordinatorXMLdefinition):

<coordinator-appname="tweets-10min-coordinator"frequency="${freq}"

start="${startTime}"end="${endTime}"timezone="UTC"

xmlns="uri:oozie:coordinator:0.2">

Themainactionnodeinacoordinatoristheworkflow,forwhichweneedtospecifyitsrootlocationonHDFSandallrequiredproperties,asfollows:

<action>

<workflow>

<app-path>${workflowPath}</app-path>

<configuration>

<property>

<name>workflowRoot</name>

<value>${workflowRoot}</value>

</property>

Wealsoneedtoincludeanypropertiesrequiredbyanyactionintheworkfloworbyanysubworkflowittriggers;ineffect,thismeansthatanyuser-definedvariablespresentinanyoftheworkflowstobetriggeredneedtobeincludedhere,asfollows:

<property>

<name>dbName</name>

Page 350: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

<value>${dbName}</value>

</property>

<property>

<name>partitionKey</name>

<value>${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}

</value>

</property>

<property>

<name>exec</name>

<value>gettweets.sh</value>

</property>

<property>

<name>inputDir</name>

<value>/tmp/tweets</value>

</property>

<property>

<name>subWorkflowRoot</name>

<value>${subWorkflowRoot}</value>

</property>

</configuration>

</workflow>

</action>

</coordinator-app>

Weusedafewcoordinator-specificfeaturesintheprecedingxml.Notethespecificationofthestartingandendingtimeofthecoordinatorandalsoitsfrequency(inminutes).Weareusingthesimplestformhere;Ooziealsohasasetoffunctionstoallowquiterichspecificationsofthefrequency.

WeusecoordinatorELfunctionsinourdefinitionofthepartitionKeyvariable.Earlier,whenrunningworkflowsfromtheCLI,wespecifiedtheseexplicitlybutmentionedtherewasabetterway—thisisit.Thefollowingexpressiongeneratesaformattedoutputcontainingtheyear,month,day,hour,andminute:

${coord:formatTime(coord:nominalTime(),'yyyyMMddhhmm')}

Ifwethenusethisasthevalueforourpartitionkey,wecanensurethateachinvocationoftheworkflowcorrectlycreatesauniquepartitioninourHCatalogtables.

Thecorrespondingjob.propertiesforthecoordinatorjoblooksmuchlikeourpreviousconfigfileswiththeusualentriesfortheNameNodeandsimilarvariablesaswellashavingvaluesfortheapplication-specificvariables,suchasdbName.Inaddition,weneedtospecifytherootofthecoordinatorlocationonHDFS,asfollows:

oozie.coord.application.path=${nameNode}/user/${user.name}/${tasksRoot}/twe

ets_10min

Notetheoozie.coordnamespaceprefixinsteadofthepreviouslyusedoozie.wf.WiththecoordinatordefinitiononHDFS,wecansubmitthefiletoOoziejustaswiththepreviousjobs.Butinthiscase,thejobwillonlyrunforagiventimeperiod.Specifically,itwillruneveryfiveminutes(thefrequencyisvariable)whenthesystemclockisbetweenstartTimeandendTime.

We’veincludedthefullconfigurationinthetweets_10mindirectoryinthesourcecodefor

Page 351: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

thischapter.

Page 352: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OtherOozietriggersTheprecedingcoordinatorhasaverysimpletrigger;itstartsperiodicallywithinaspecifiedtimerange.Ooziehasanadditionalcapabilitycalleddatasets,whereitcanbetriggeredbytheavailabilityofnewdata.

Thisisn’tagreatfitforhowwe’vedefinedourpipelineuntilnow,butimaginethat,insteadofourworkflowcollectingtweetsasitsfirststep,anexternalsystemwaspushingnewfilesoftweetsontoHDFSonacontinuousbasis.OoziecanbeconfiguredtoeitherlookforthepresenceofnewdatabasedonadirectorypatternortospecificallytriggerwhenareadyfileappearsonHDFS.ThislatterconfigurationprovidesaveryconvenientmechanismwithwhichtointegratetheoutputofMapReducejobs,whichbydefault,writea_SUCCESSfileintotheiroutputdirectory.

Ooziedatasetsarearguablyoneofthemostpowerfulpartsofthewholesystem,andwecannotdothemjusticehereforspacereasons.ButwedostronglyrecommendthatyouconsulttheOoziehomepageformoreinformation.

Page 353: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PullingitalltogetherLet’sreviewwhatwe’vediscusseduntilnowandhowwecanuseOozietobuildasophisticatedseriesofworkflowsthatimplementanapproachtodatalifecyclemanagementbyputtingtogetherallthediscussedtechniques.

First,it’simportanttodefineclearresponsibilitiesandimplementpartsofthesystemusinggooddesignandseparationofconcernprinciples.Byapplyingthis,weendupwithseveraldifferentworkflows:

Asubworkflowtoensuretheenvironment(mainlyHDFSandHivemetadata)iscorrectlyconfiguredAsubworkflowtoperformdatavalidationThemainworkflowthattriggersboththeprecedingsubworkflowsandthenpullsnewdatathroughamultistepingestpipelineAcoordinatorthatexecutestheprecedingworkflowsevery10minutesAsecondcoordinatorthatingestsreferencedatathatwillbeusefultotheapplicationpipeline

WealsodefineallourtableswithAvroschemasandusethemwhereverpossibletohelpmanageschemaevolutionandchangingdataformatsovertime.

Wepresentthefullsourcecodeofthesecomponentsinthefinalversionoftheworkflowinthesourcecodeofthischapter.

Page 354: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OthertoolstohelpThoughOozieisaverypowerfultool,sometimesitcanbesomewhatdifficulttocorrectlywriteworkflowdefinitionfiles.Aspipelinesgetsizeable,managingcomplexitybecomesachallengeevenwithgoodfunctionalpartitioningintomultipleworkflows.Atasimplerlevel,XMLisjustneverfunforahumantowrite!Thereareafewtoolsthatcanhelp.Hue,thetoolcallingitselftheHadoopUI(http://gethue.com/),providessomegraphicaltoolstohelpcompose,execute,andmanageOozieworkflows.Thoughpowerful,Hueisnotabeginnertool;we’llmentionitalittlemoreinChapter11,WheretoGoNext.

AnewApacheprojectcalledFalcon(http://falcon.incubator.apache.org)mightalsobeofinterest.FalconusesOozietobuildarangeofmuchhigher-leveldataflowsandactions.Forexample,Falconprovidesrecipestoenableandensurecross-sitereplicationacrossmultipleHadoopclusters.TheFalconteamisworkingonmuchbetterinterfacestobuildtheirworkflows,sotheprojectmightwellbeworthwatching.

Page 355: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryHopefully,thischapterpresentedthetopicofdatalifecyclemanagementassomethingotherthanadryabstractconcept.Wecoveredalot,particularly:

ThedefinitionofdatalifecyclemanagementandhowitcoversanumberofissuesandtechniquesthatusuallybecomeimportantwithlargedatavolumesTheconceptofbuildingadataingestpipelinealonggooddatalifecyclemanagementprinciplesthatcanthenbeutilizedbyhigher-levelanalytictoolsOozieasaHadoop-focusedworkflowmanagerandhowwecanuseittocomposeaseriesofactionsintoaunifiedworkflowVariousOozietools,suchassubworkflows,parallelactionexecution,andglobalvariables,thatallowustoapplytruedesignprinciplestoourworkflowsHCatalogandhowitprovidesthemeansfortoolsotherthanHivetoreadandwritetable-structureddata;weshoweditsgreatpromiseandintegrationwithtoolssuchasPigbutalsohighlightedsomecurrentweaknessesAvroasourtoolofchoicetohandleschemaevolutionovertimeUsingOoziecoordinatorstobuildscheduledworkflowsbasedeitherontimeintervalsordataavailabilitytodrivetheexecutionofmultipleingestpipelinesSomeothertoolsthatcanmakethesetaskseasier,namely,HueandFalcon

Inthenextchapter,we’lllookatseveralofthehigher-levelanalytictoolsandframeworksthatcanbuildsophisticatedapplicationlogicuponthedatacollectedinaningestpipeline.

Page 356: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter9.MakingDevelopmentEasierInthischapter,wewilllookathow,dependingonusecasesandendgoals,applicationdevelopmentinHadoopcanbesimplifiedusinganumberofabstractionsandframeworksbuiltontopoftheJavaAPIs.Inparticular,wewilllearnaboutthefollowingtopics:

HowthestreamingAPIallowsustowriteMapReducejobsusingdynamiclanguagessuchasPythonandRubyHowframeworkssuchasApacheCrunchandKiteMorphlinesallowustoexpressdatatransformationpipelinesusinghigher-levelabstractionsHowKiteData,apromisingframeworkdevelopedbyCloudera,providesuswiththeabilitytoapplydesignpatternsandboilerplatetoeaseintegrationandinteroperabilityofdifferentcomponentswithintheHadoopecosystem

Page 357: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ChoosingaframeworkInthepreviouschapters,welookedattheMapReduceandSparkprogrammingAPIstowritedistributedapplications.Althoughverypowerfulandflexible,theseAPIscomewithacertainlevelofcomplexityandpossiblyrequiresignificantdevelopmenttime.

Inanefforttoreduceverbosity,weintroducedthePigandHiveframeworks,whichcompiledomain-specificlanguages,PigLatinandHiveQL,intoanumberofMapReducejobsorSparkDAGs,effectivelyabstractingtheAPIsaway.BothlanguagescanbeextendedwithUDFs,whichisawayofmappingcomplexlogictothePigandHivedatamodels.

Attimeswhenweneedacertaindegreeofflexibilityandmodularity,thingscangettricky.Dependingontheusecaseanddeveloperneeds,theHadoopecosystempresentsavastchoiceofAPIs,frameworks,andlibraries.Inthischapter,weidentifyfourcategoriesofusersandmatchthemwiththefollowingrelevanttools:

DevelopersthatwanttoavoidJavainfavorofscriptingMapReducejobsusingdynamiclanguages,oruselanguagesnotimplementedontheJVM.Atypicalusecasewouldbeupfrontanalysisandrapidprototyping:HadoopstreamingJavadevelopersthatneedtointegratecomponentsoftheHadoopecosystemandcouldbenefitfromcodifieddesignpatternsandboilerplate:KiteDataJavadeveloperswhowanttowritemodulardatapipelinesusingafamiliarAPI:ApacheCrunchDeveloperswhowouldratherconfigurechainsofdatatransformations.Forinstance,adataengineerthatwantstoembedexistingcodeinanETLpipeline:KiteMorphlines

Page 358: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HadoopstreamingWehavementionedpreviouslythatMapReduceprogramsdon’thavetobewritteninJava.Thereareseveralreasonswhyyoumightwantorneedtowriteyourmapandreducetasksinanotherlanguage.Perhapsyouhaveexistingcodetoleverageorneedtousethird-partybinaries—thereasonsarevariedandvalid.

Hadoopprovidesanumberofmechanismstoaidnon-Javadevelopment,primaryamongstwhichareHadooppipesthatprovideanativeC++interfaceandHadoopstreamingthatallowsanyprogramthatusesstandardinputandoutputtobeusedformapandreducetasks.WiththeMapReduceJavaAPI,bothmapandreducetasksprovideimplementationsformethodsthatcontainthetaskfunctionality.ThesemethodsreceivetheinputtothetaskasmethodargumentsandthenoutputresultsviatheContextobject.Thisisaclearandtype-safeinterface,butitisbydefinitionJava-specific.

Hadoopstreamingtakesadifferentapproach.Withstreaming,youwriteamaptaskthatreadsitsinputfromstandardinput,onelineatatime,andgivestheoutputofitsresultstostandardoutput.Thereducetaskthendoesthesame,againusingonlystandardinputandoutputforitsdataflow.

Anyprogramthatreadsandwritesfromstandardinputandoutputcanbeusedinstreaming,suchascompiledbinaries,Unixshellscripts,orprogramswritteninadynamiclanguagesuchasPythonorRuby.ThebiggestadvantagetostreamingisthatitcanallowyoutotryideasanditeratethemmorequicklythanusingJava.Insteadofacompile/JAR/submitcycle,youjustwritethescriptsandpassthemasargumentstothestreamingJARfile.Especiallywhendoinginitialanalysisonanewdatasetortryingoutnewideas,thiscansignificantlyspeedupdevelopment.

Theclassicdebateregardingdynamicversusstaticlanguagesbalancesthebenefitsofswiftdevelopmentagainstruntimeperformanceandtypechecking.Thesedynamicdownsidesalsoapplywhenusingstreaming.Consequently,wefavortheuseofstreamingforupfrontanalysisandJavafortheimplementationofjobsthatwillbeexecutedontheproductioncluster.

Page 359: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

StreamingwordcountinPythonWe’lldemonstrateHadoopstreamingbyre-implementingourfamiliarwordcountexampleusingPython.First,wecreateascriptthatwillbeourmapper.ItconsumesUTF-8encodedrowsoftextfromstandardinputwithaforloop,splitsthisintowords,andusestheprintfunctiontowriteeachwordtostandardoutput,asfollows:

#!/bin/envpython

importsys

forlineinsys.stdin:

#skipemptylines

ifline=='\n':

continue

#preserveutf-8encoding

try:

line=line.encode('utf-8')

exceptUnicodeDecodeError:

continue

#newlinecharacterscanappearwithinthetext

line=line.replace('\n','')

#lowercaseandtokenize

line=line.lower().split()

forterminline:

ifnotterm:

continue

try:

print(

u"%s"%(

term.decode('utf-8')))

exceptUnicodeEncodeError:

continue

Thereducercountsthenumberofoccurrencesofeachwordfromstandardinput,andgivestheoutputasthefinalvaluetostandardoutput,asfollows:

#!/bin/envpython

importsys

count=1

current=None

forwordinsys.stdin:

word=word.strip()

ifword==current:

count+=1

else:

ifcurrent:

print"%s\t%s"%(current.decode('utf-8'),count)

current=word

Page 360: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

count=1

ifcurrent==word:

print"%s\t%s"%(current.decode('utf-8'),count)

NoteInbothcases,weareimplicitlyusingHadoopinputandoutputformatsdiscussedintheearlierchapters.ItistheTextInputFormatthatprocessesthesourcefileandprovideseachlineoneatatimetothemapscript.Conversely,theTextOutputFormatwillensurethattheoutputofreducetasksisalsocorrectlywrittenastext.

Copymap.pyandreduce.pytoHDFS,andexecutethescriptsasastreamingjobusingthesampledatafromthepreviouschapters,asfollows:

$hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-filemap.py\

-mapper"pythonmap.py"\

-filereduce.py\

-reducer"pythonreduce.py"\

-inputsample.txt\

-outputoutput.txt

NoteTweetsareUTF-8encoded.MakesurethatPYTHONIOENCODINGissetaccordinglyinordertopipedatainaUNIXterminal:

$exportPYTHONIOENCODING='UTF-8'

Thesamecodecanbeexecutedfromthecommand-lineprompt:

$catsample.txt|pythonmap.py|pythonreduce.py>out.txt

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/wc/python/map.py.

Page 361: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DifferencesinjobswhenusingstreamingInJava,weknowthatourmap()methodwillbeinvokedonceforeachinputkey/valuepairandourreduce()methodwillbeinvokedforeachkeyanditssetofvalues.

Withstreaming,wedon’thavetheconceptofthemaporreducemethodsanymore;insteadwehavewrittenscriptsthatprocessstreamsofreceiveddata.Thischangeshowweneedtowriteourreducer.InJava,thegroupingofvaluestoeachkeywasperformedbyHadoop;eachinvocationofthereducemethodwouldreceiveasingle,tabseparatedkeyandallitsvalues.Instreaming,eachinstanceofthereducetaskisgiventheindividualungatheredvaluesoneatatime.

Hadoopstreamingdoessortthekeys,forexample,ifamapperemittedthefollowingdata:

First1

Word1

Word1

A1

First1

Thestreamingreducerwouldreceiveitinthefollowingorder:

A1

First1

First1

Word1

Word1

Hadoopstillcollectsthevaluesforeachkeyandensuresthateachkeyispassedonlytoasinglereducer.Inotherwords,areducergetsallthevaluesforanumberofkeys,andtheyaregroupedtogether;however,theyarenotpackagedintoindividualexecutionsofthereducer,thatis,oneperkey,aswiththeJavaAPI.SinceHadoopstreamingusesthestdinandstdoutchannelstoexchangedatabetweentasks,debuganderrormessagesshouldnotbeprintedtostandardoutput.Inthefollowingexample,wewillusethePythonlogging(https://docs.python.org/2/library/logging.html)packagetologwarningstatementstoafile.

Page 362: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

FindingimportantwordsintextWewillnowimplementametric,TermFrequency-InverseDocumentFrequency(TF-IDF),thatwillhelpustodeterminetheimportanceofwordsbasedonhowfrequentlytheyappearacrossasetofdocuments(tweets,inourcase).

Intuitively,ifawordappearsfrequentlyinadocumentitisimportantandshouldbegivenahighscore.However,ifawordappearsinmanydocuments,weshouldpenalizeitwithalowerscore,asitisacommonwordanditsfrequencyisnotuniquetothisdocument.

Therefore,commonwordssuchasthe,andfor,whichappearinmanydocuments,willbescaleddown.Wordsthatappearfrequentlyinasingletweetwillbescaledup.UsesofTF-IDF,oftenincombinationwithothermetricsandtechniques,includestopwordremovalandtextclassification.Notethatthistechniquewillhaveshortcomingswhendealingwithshortdocuments,suchastweets.Insuchcases,thetermfrequencycomponentwilltendtobecomeone.Conversely,onecouldexploitthispropertytodetectoutliers.

ThedefinitionofTF-IDFwewilluseinourexampleisthefollowing:

tf=#oftimestermappearsinadocument(rawfrequency)

idf=1+log(#ofdocuments/#documentswithterminit)

tf-idf=tf*idf

WewillimplementthealgorithminPythonusingthreeMapReducejobs:

ThefirstonecalculatestermfrequencyThesecondonecalculatesdocumentfrequency(thedenominatorofIDF)Thethirdonecalculatesper-tweetTF-IDF

CalculatetermfrequencyThetermfrequencypartisverysimilartothewordcountexample.Themaindifferenceisthatwewillbeusingamulti-field,tab-separated,keytokeeptrackofco-occurrencesoftermsanddocumentIDs.Foreachtweet—inJSONformat—themapperextractstheid_strandtextfields,tokenizestext,andemitsaterm,doc_idtuple:

fortweetinsys.stdin:

#skipemptylines

iftweet=='\n':

continue

try:

tweet=json.loads(tweet)

except:

logger.warn("Invalidinput%s"%tweet)

continue

#Inourexampleonetweetcorrespondstoonedocument.

doc_id=tweet['id_str']

ifnotdoc_id:

continue

#preserveutf-8encoding

text=tweet['text'].encode('utf-8')

Page 363: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

#newlinecharacterscanappearwithinthetext

text=text.replace('\n','')

#lowercaseandtokenize

text=text.lower().split()

fortermintext:

try:

print(

u"%s\t%s"%(

term.decode('utf-8'),doc_id.decode('utf-8'))

)

exceptUnicodeEncodeError:

logger.warn("Invalidterm%s"%term)

Inthereducer,weemitthefrequencyofeachterminadocumentasatab-separatedstring:

freq=1

cur_term,cur_doc_id=sys.stdin.readline().split()

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id=line.split('\t')

except:

logger.warn("Invalidrecord%s"%line)

#thekeyisa(doc_id,term)pair

if(doc_id==cur_doc_id)and(term==cur_term):

freq+=1

else:

print(

u"%s\t%s\t%s"%(

cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),

freq))

cur_doc_id=doc_id

cur_term=term

freq=1

print(

u"%s\t%s\t%s"%(

cur_term.decode('utf-8'),cur_doc_id.decode('utf-8'),freq))

Forthisimplementationtowork,itiscrucialthatthereducerinputissortedbyterm.Wecantestbothscriptsfromthecommandlinewiththefollowingpipe:

$cattweets.json|pythonmap-tf.py|sort-k1,2|\

pythonreduce-tf.py

Whereasatthecommandlineweusethesortutility,inMapReducewewilluseorg.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator.Thiscomparatorimplementsasubsetoffeaturesprovidedbythesortcommand.Inparticular,orderingbyfieldcanbespecifiedwiththe–k<position>option.Tofilterbyterm,thefirstfieldofourkey,weset-Dmapreduce.text.key.comparator.options=-k1:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

Page 364: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

streaming.jar\

-Dmap.output.key.field.separator=\t\

-Dstream.num.map.output.key.fields=2\

-Dmapreduce.output.key.comparator.class=\

org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\

-Dmapreduce.text.key.comparator.options=-k1,2\

-inputtweets.json\

-output/tmp/tf-out.tsv\

-filemap-tf.py\

-mapper"pythonmap-tf.py"\

-filereduce-tf.py\

-reducer"pythonreduce-tf.py"

NoteWespecifywhichfieldsbelongtothekey(forshuffling)inthecomparatoroptions.

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-tf.py.

CalculatedocumentfrequencyThemainlogictocalculatedocumentfrequencyisinthereducer,whilethemapperisjustanidentityfunctionthatloadsandpipesthe(orderedbyterm)outputoftheTFjob.Inthereducer,foreachterm,wecounthowmanytimesitoccursacrossalldocuments.Foreachterm,wekeepabufferkey_cacheof(term,doc_id,tf)tuples,andwhenanewtermisfoundweflushthebuffertostandardoutput,togetherwiththeaccumulateddocumentfrequencydf:

#Cachethe(term,doc_id,tf)tuple.

key_cache=[]

line=sys.stdin.readline().strip()

cur_term,cur_doc_id,cur_tf=line.split('\t')

cur_tf=int(cur_tf)

cur_df=1

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id,tf=line.strip().split('\t')

tf=int(tf)

except:

logger.warn("Invalidrecord:%s"%line)

continue

#termistheonlykeyforthisinput

if(term==cur_term):

#incrementdocumentfrequency

cur_df+=1

key_cache.append(

u"%s\t%s\t%s"%(term.decode('utf-8'),doc_id.decode('utf-8'),

tf))

Page 365: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

else:

forkeyinkey_cache:

print("%s\t%s"%(key,cur_df))

print(

u"%s\t%s\t%s\t%s"%(

cur_term.decode('utf-8'),

cur_doc_id.decode('utf-8'),

cur_tf,cur_df)

)

#flushthecache

key_cache=[]

cur_doc_id=doc_id

cur_term=term

cur_tf=tf

cur_df=1

forkeyinkey_cache:

print(u"%s\t%s"%(key.decode('utf-8'),cur_df))

print(

u"%s\t%s\t%s\t%s\n"%(

cur_term.decode('utf-8'),

cur_doc_id.decode('utf-8'),

cur_tf,cur_df))

Wecantestthescriptsfromthecommandlinewith:

$cat/tmp/tf-out.tsv|pythonmap-df.py|pythonreduce-df.py>

/tmp/df-out.tsv

AndwecantestthescriptsonHadoopstreamingwith:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-Dmap.output.key.field.separator=\t\

-Dstream.num.map.output.key.fields=3\

-Dmapreduce.output.key.comparator.class=\

org.apache.hadoop.mapreduce.lib.KeyFieldBasedComparator\

-Dmapreduce.text.key.comparator.options=-k1\

-input/tmp/tf-out.tsv/part-00000\

-output/tmp/df-out.tsv\

-mapperorg.apache.hadoop.mapred.lib.IdentityMapper\

-filereduce-df.py\

-reducer"pythonreduce-df.py"

OnHadoopweuseorg.apache.hadoop.mapred.lib.IdentityMapper,whichprovidesthesamelogicasthemap-df.pyscript.

Themapperandreducercodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/map-df.py.

Puttingitalltogether–TF-IDFTocalculateTF-IDF,weonlyneedamapperthatconsumestheoutputoftheprevious

Page 366: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

step:

num_doc=sys.argv[1]

forlineinsys.stdin:

line=line.strip()

try:

term,doc_id,tf,df=line.split('\t')

tf=float(tf)

df=float(df)

num_doc=float(num_doc)

except:

logger.warn("Invalidrecord%s"%line)

#idf=num_doc/df

tf_idf=tf*(1+math.log(num_doc/df))

print("%s\t%s\t%s"%(term,doc_id,tf_idf))

Thenumberofdocumentsinthecollectionispassedasaparametertotf-idf.py:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-Dmapreduce.reduce.tasks=0\

-input/tmp/df-out.tsv/part-00000\

-output/tmp/tf-idf.out\

-filetf-idf.py\

-mapper"pythontf-idf.py15578"

Tocalculatethetotalnumberoftweets,wecanusethecatandwcUnixutilitiesincombinationwithHadoopstreaming:

/usr/bin/hadoopjar/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-

streaming.jar\

-inputtweets.json\

-outputtweets.cnt\

-mapper/bin/cat\

-reducer/usr/bin/wc

Themappersourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/streaming/tf-idf/python/tf-idf.py.

Page 367: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

KiteDataTheKiteSDK(http://www.kitesdk.org)isacollectionofclasses,command-linetools,andexamplesthataimsateasingtheprocessofbuildingapplicationsontopofHadoop.

InthissectionwewilllookathowKiteData,asubprojectofKite,caneaseintegrationwithseveralcomponentsofaHadoopdatawarehouse.Kiteexamplescanbefoundathttps://github.com/kite-sdk/kite-examples.

OnCloudera’sQuickStartVM,KiteJARscanbefoundat/opt/cloudera/parcels/CDH/lib/kite/.

KiteDataisorganizedinanumberofsubprojects,someofwhichwe’lldescribeinthefollowingsections.

Page 368: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataCoreAsthenamesuggests,thecoreisthebuildingblockforallcapabilitiesprovidedintheDatamodule.Itsprincipalabstractionsaredatasetsandrepositories.

Theorg.kitesdk.data.Datasetinterfaceisusedtorepresentanimmutablesetofdata:

@Immutable

publicinterfaceDataset<E>extendsRefinableView<E>{

StringgetName();

DatasetDescriptorgetDescriptor();

Dataset<E>getPartition(PartitionKeykey,booleanautoCreate);

voiddropPartition(PartitionKeykey);

Iterable<Dataset<E>>getPartitions();

URIgetUri();

}

Eachdatasetisidentifiedbyanameandaninstanceoftheorg.kitesdk.data.DatasetDescriptorinterface,thatisthestructuraldescriptionofadatasetandprovidesitsschema(org.apache.avro.Schema)andpartitioningstrategy.

ImplementationsoftheReader<E>interfaceareusedtoreaddatafromanunderlyingstoragesystemandproducedeserializedentitiesoftypeE.ThenewReader()methodcanbeusedtogetanappropriateimplementationforagivendataset:

publicinterfaceDatasetReader<E>extendsIterator<E>,Iterable<E>,

Closeable{

voidopen();

booleanhasNext();

Enext();

voidremove();

voidclose();

booleanisOpen();

}

AninstanceofDatasetReaderwillprovidemethodstoreadanditerateoverstreamsofdata.Similarly,org.kitesdk.data.DatasetWriterprovidesaninterfacetowritestreamsofdatatotheDatasetobjects:

publicinterfaceDatasetWriter<E>extendsFlushable,Closeable{

voidopen();

voidwrite(Eentity);

voidflush();

voidclose();

booleanisOpen();

}

Likereaders,writersareuse-onceobjects.TheyserializeinstancesofentitiesoftypeEandwritethemtotheunderlyingstoragesystem.Writersareusuallynotinstantiateddirectly;rather,anappropriateimplementationcanbecreatedbythenewWriter()factorymethod.ImplementationsofDatasetWriterwillholdresourcesuntilclose()iscalledandexpect

Page 369: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

thecallertoinvokeclose()inafinallyblockwhenthewriterisnolongerinuse.Finally,notethatimplementationsofDatasetWriteraretypicallynotthread-safe.Thebehaviorofawriterbeingaccessedfrommultiplethreadsisundefined.

AparticularcaseofadatasetistheViewinterface,whichisasfollows:

publicinterfaceView<E>{

Dataset<E>getDataset();

DatasetReader<E>newReader();

DatasetWriter<E>newWriter();

booleanincludes(Eentity);

publicbooleandeleteAll();

}

Viewscarrysubsetsofthekeysandpartitionsofanexistingdataset;theyareconceptuallysimilartothenotionof“view”intherelationalmodel.

AViewinterfacecanbecreatedfromrangesofdata,orrangesofkeys,orasaunionbetweenotherviews.

Page 370: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataHCatalogDataHCatalogisamodulethatenablestheaccessingofHCatalogrepositories.Thecoreabstractionsofthismoduleareorg.kitesdk.data.hcatalog.HCatalogAbstractDatasetRepositoryanditsconcreteimplementation,org.kitesdk.data.hcatalog.HCatalogDatasetRepository.

TheydescribeaDatasetRepositorythatusesHCatalogtomanagemetadataandHDFSforstorage,asfollows:

publicclassHCatalogDatasetRepositoryextends

HCatalogAbstractDatasetRepository{

HCatalogDatasetRepository(Configurationconf){

super(conf,newHCatalogManagedMetadataProvider(conf));

}

HCatalogDatasetRepository(Configurationconf,MetadataProviderprovider)

{

super(conf,provider);

}

public<E>Dataset<E>create(Stringname,DatasetDescriptordescriptor)

{

getMetadataProvider().create(name,descriptor);

returnload(name);

}

publicbooleandelete(Stringname){

returngetMetadataProvider().delete(name);

}

publicstaticclassBuilder{

}

}

NoteAsofKite0.17,DataHCatalogisdeprecatedinfavorofthenewDataHivemodule.

ThelocationofthedatadirectoryiseitherchosenbyHive/HCatalog(so-called“managedtables”),orspecifiedwhencreatinganinstanceofthisclassbyprovidingafilesystemandarootdirectoryintheconstructor(externaltables).

Page 371: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataHiveThekite-data-moduleexposesHiveschemasviatheDatasetinterface.AsofKite0.17,thispackagesupersedesDataHCatalog.

Page 372: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataMapReduceTheorg.kitesdk.data.mapreducepackageprovidesinterfacestoreadandwritedatatoandfromaDatasetwithMapReduce.

Page 373: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataSparkTheorg.kitesdk.data.sparkpackageprovidesinterfacesforreadingandwritingdatatoandfromaDatasetwithApacheSpark.

Page 374: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataCrunchTheorg.kitesdk.data.crunch.CrunchDatasetspackageisahelperclasstoexposedatasetsandviewsasCrunchReadableSourceorTargetclasses:

publicclassCrunchDatasets{

publicstatic<E>ReadableSource<E>asSource(View<E>view,Class<E>type){

returnnewDatasetSourceTarget<E>(view,type);

}

publicstatic<E>ReadableSource<E>asSource(URIuri,Class<E>type){

returnnewDatasetSourceTarget<E>(uri,type);

}

publicstatic<E>ReadableSource<E>asSource(Stringuri,Class<E>type){

returnasSource(URI.create(uri),type);

}

publicstatic<E>TargetasTarget(View<E>view){

returnnewDatasetTarget<E>(view);

}

publicstaticTargetasTarget(Stringuri){

returnasTarget(URI.create(uri));

}

publicstaticTargetasTarget(URIuri){

returnnewDatasetTarget<Object>(uri);

}

}

Page 375: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheCrunchApacheCrunch(http://crunch.apache.org)isaJavaandScalalibrarytocreatepipelinesofMapReducejobs.ItisbasedonGoogle’sFlumeJava(http://dl.acm.org/citation.cfm?id=1806638)paperandlibrary.TheprojectgoalistomakethetaskofwritingMapReducejobsasstraightforwardaspossibleforanybodyfamiliarwiththeJavaprogramminglanguagebyexposinganumberofpatternsthatimplementoperationssuchasaggregating,joining,filtering,andsortingrecords.

SimilartotoolssuchasPig,Crunchpipelinesarecreatedbycomposingimmutable,distributeddatastructuresandrunningallprocessingoperationsonsuchstructures;theyareexpressedandimplementedasuser-definedfunctions.PipelinesarecompiledintoaDAGofMapReducejobs,whoseexecutionismanagedbythelibrary’splanner.Crunchallowsustowriteiterativecodeandabstractsawaythecomplexityofthinkingintermsofmapandreduceoperations,whileatthesametimeavoidingtheneedofanadhocprogramminglanguagesuchasPigLatin.Inaddition,Crunchoffersahighlycustomizabletypesystemthatallowsustoworkwith,andmix,HadoopWritables,HBase,andAvroserializedobjects.

FlumeJava’smainassumptionisthatMapReduceisthewronglevelofabstractionforseveralclassesofproblems,wherecomputationsareoftenmadeupofmultiple,chainedjobs.Frequently,weneedtocomposelogicallyindependentoperations(forexample,filtering,projecting,grouping,andothertransformations)intoasinglephysicalMapReducejobforperformancereasons.Thisaspectalsohasimplicationsforcodetestability.Althoughwewon’tcoverthisaspectinthischapter,thereaderisencouragedtolookfurtherintoitbyconsultingCrunch’sdocumentation.

Page 376: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingstartedCrunchJARsarealreadyinstalledontheQuickStartVM.Bydefault,theJARsarefoundin/opt/cloudera/parcels/CDH/lib/crunch.

Alternatively,recentCrunchlibrariescanbedownloadedfromhttps://crunch.apache.org/download.html,fromMavenCentralorCloudera-specificrepositories.

Page 377: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ConceptsCrunchpipelinesarecreatedbycomposingtwoabstractions:PCollectionandPTable.

ThePCollection<T>interfaceisadistributed,immutablecollectionofobjectsoftypeT.ThePTable<Key,Value>interfaceisadistributed,immutablehashtable—asub-interfaceofPCollection—ofkeysoftheKeytypeandvaluesoftheValuetypethatexposesmethodstoworkwiththekey-valuepairs.

Thesetwoabstractionssupportthefollowingfourprimitiveoperations:

parallelDo:appliesauser-definedfunction,DoFn,toagivenPCollectionandreturnsanewPCollectionunion:mergestwoormorePCollectionsintoasinglevirtualPCollectiongroupByKey:sortsandgroupstheelementsofaPTablebytheirkeyscombineValues:aggregatesthevaluesfromagroupByKeyoperation

Thehttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/HashtagCount.javaimplementsaCrunchMapReducepipelinethatcountshashtagoccurrences:

Pipelinepipeline=newMRPipeline(HashtagCount.class,getConf());

pipeline.enableDebug();

PCollection<String>lines=pipeline.readTextFile(args[0]);

PCollection<String>words=lines.parallelDo(newDoFn<String,String>(){

publicvoidprocess(Stringline,Emitter<String>emitter){

for(Stringword:line.split("\\s+")){

if(word.matches("(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)")){

emitter.emit(word);

}

}

}

},Writables.strings());

PTable<String,Long>counts=words.count();

pipeline.writeTextFile(counts,args[1]);

//ExecutethepipelineasaMapReduce.

pipeline.done();

Inthisexample,wefirstcreateaMRPipelinepipelineanduseittofirstreadthecontentofsample.txtcreatedwithstream.py-tintoacollectionofstrings,whereeachelementofthecollectionrepresentsatweet.Wetokenizeeachtweetintowordswithtweet.split("\\s+"),andweemiteachwordthatmatchesthehashtagregularexpression,serializedasWritable.NotethatthetokenizingandfilteringoperationsareexecutedinparallelbyMapReducejobscreatedbytheparallelDocall.WecreateaPTablethatassociateseachhashtag,representedasastring,withthenumberoftimesitoccurredinthedatasets.Finally,wewritethePTablecountsintoHDFSasatextfile.The

Page 378: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

pipelineisexecutedwithpipeline.done().

Tocompileandexecutethepipeline,wecanuseGradletomanagetheneededdependencies,asfollows:

$./gradlewjar

$./gradlewcopyJars

AddtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable:

$exportCRUNCH_DEPS=build/libjars/crunch-example/lib

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-

cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-

mapred-1.7.5-cdh5.0.3-hadoop2.jar

Then,runtheexampleonHadoop:

$hadoopjarbuild/libs/crunch-example.jar\

com.learninghadoop2.crunch.HashtagCount\

tweets.jsoncount-out\

-libjars$LIBJARS

Page 379: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataserializationOneoftheframework’sgoalsistomakeiteasytoprocesscomplexrecordscontainingnestedandrepeateddatastructures,suchasprotocolbuffersandThriftrecords.

Theorg.apache.crunch.types.PTypeinterfacedefinesthemappingbetweenadatatypethatisusedinaCrunchpipelineandaserializationandstorageformatthatisusedtoread/writedatafrom/toHDFS.EveryPCollectionhasanassociatedPTypethattellsCrunchhowtoread/writedata.

Theorg.apache.crunch.types.PTypeFamilyinterfaceprovidesanabstractfactorytoimplementinstancesofPTypethatsharethesameserializationformat.Currently,Crunchsupportstwotypefamilies:onebasedontheWritableinterfaceandtheotheronApacheAvro.

NoteAlthoughCrunchpermitsmixingandmatchingPCollectioninterfacesthatusedifferentinstancesofPTypeinthesamepipeline,eachPCollectioninterfaces’sPTypemustbelongtoauniquefamily.Forinstance,itisnotpossibletohaveaPTablewithakeyserializedasWritableanditsvalueserializedusingAvro.

Bothtypefamiliessupportacommonsetofprimitivetypes(strings,longs,integers,floats,doubles,booleans,andbytes)aswellasmorecomplexPTypeinterfacesthatcanbeconstructedoutofotherPTypes.TheseincludetuplesandcollectionsofotherPType.Aparticularlyimportant,complex,PTypeistableOf,whichdetermineswhetherthereturntypeofparalleDowillbeaPCollectionorPTable.

NewPTypescanbecreatedbyinheritingandextendingthebuilt-insoftheAvroandWritablefamilies.ThisrequiresimplementinginputMapFn<S,T>andoutputMapFn<T,S>classes.WeareimplementingPTypeforinstanceswhereSistheoriginaltypeandTisthenewtype.

DerivedPTypescanbefoundinthePTypesclass.Theseincludeserializationsupportforprotocolbuffers,Thriftrecords,JavaEnums,BigInteger,andUUIDs.TheElephantBirdlibrarywediscussedinChapter6,DataAnalysiswithApachePig,containsadditionalexamples.

Page 380: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Dataprocessingpatternsorg.apache.crunch.libimplementsanumberofdesignpatternsforcommondatamanipulationoperations.

AggregationandsortingMostofthedataprocessingpatternsprovidedbyorg.apache.crunch.librelyonthePTable‘sgroupByKeymethod.Themethodhasthreedifferentoverloadedforms:

groupByKey():letstheplannerdeterminethenumberofpartitionsgroupByKey(intnumPartitions):isusedtosetthenumberofpartitionsspecifiedbythedevelopergroupByKey(GroupingOptionsoptions):allowsustospecifycustompartitionsandcomparatorsforshuffling

Theorg.apache.crunch.GroupingOptionsclasstakesinstancesofHadoop’sPartitionerandRawComparatorclassestoimplementcustompartitioningandsortingoperations.

ThegroupByKeymethodreturnsaninstanceofPGroupedTable,Crunch’srepresentationofagroupedtable.ItcorrespondstotheoutputoftheshufflephaseofaMapReducejobandallowsvaluestobecombinedwiththecombineValuemethod.

Theorg.apache.crunch.lib.Aggregatepackageexposesmethodstoperformsimpleaggregations(count,max,top,andlength)onthePCollectioninstances.

SortprovidesanAPItosortPCollectionandPTableinstanceswhosecontentsimplementtheComparableinterface.

Bydefault,Crunchsortsdatausingonereducer.Thisbehaviorcanbemodifiedbypassingthenumberofpartitionsrequiredtothesortmethod.TheSort.Ordermethodsignalstheorderinwhichasortshouldbedone.

Thefollowingarehowdifferentsortoptionscanbespecifiedforcollections:

publicstatic<T>PCollection<T>sort(PCollection<T>collection)

publicstatic<T>PCollection<T>sort(PCollection<T>collection,Sort.Order

order)

publicstatic<T>PCollection<T>sort(PCollection<T>collection,int

numReducers,Sort.Orderorder)

Thefollowingarehowdifferentsortoptionscanbespecifiedfortables:

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table)

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,Sort.Orderkey)

publicstatic<K,V>PTable<K,V>sort(PTable<K,V>table,intnumReducers,

Sort.Orderkey)

Finally,sortPairssortsthePCollectionofpairsusingthespecifiedcolumnorderinSort.ColumnOrder:

Page 381: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

sortPairs(PCollection<Pair<U,V>>collection,Sort.ColumnOrder…

columnOrders)

JoiningdataTheorg.apache.crunch.lib.JoinpackageisanAPItojoinPTablesbasedonacommonkey.Thefollowingfourjoinoperationsaresupported:

fullJoin

join(defaultstoinnerJoin)leftJoin

rightJoin

Themethodshaveacommonreturntypeandsignature.Forreference,wewilldescribethecommonlyusedjoinmethodthatimplementsaninnerjoin:

publicstatic<K,U,V>PTable<K,Pair<U,V>>join(PTable<K,U>left,

PTable<K,V>right)

Theorg.apache.crunch.lib.Join.JoinStrategypackageprovidesaninterfacetodefinecustomjoinstrategies.Crunch’sdefaultstrategy(defaultStrategy)istojoindatareduce-side.

Page 382: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PipelinesimplementationandexecutionCrunchcomeswiththreeimplementationsofthepipelineinterface.Theoldestone,implicitlyusedinthischapter,isorg.apache.crunch.impl.mr.MRPipeline,whichusesHadoop’sMapReduceasitsexecutionengine.org.apache.crunch.impl.mem.MemPipelineallowsalloperationstobeperformedinmemory,withnoserializationtodiskperformed.Crunch0.10introducedorg.apache.crunch.impl.spark.SparkPipelinewhichcompilesandrunsaDAGofPCollectionstoApacheSpark.

SparkPipelineWithSparkPipeline,CrunchdelegatesmuchoftheexecutiontoSparkanddoesrelativelylittleoftheplanningtasks,withthefollowingexceptions:

MultipleinputsMultipleoutputsDataserializationCheckpointing

Atthetimeofwriting,SparkPipelineisstillheavilyunderdevelopmentandmightnothandlealloftheusecasesofastandardMRPipeline.TheCrunchcommunityisactivelyworkingtoensurecompletecompatibilitybetweenthetwoimplementations.

MemPipelineMemPipelineexecutesin-memoryonaclient.UnlikeMRPipeline,MemPipelineisnotexplicitlycreatedbutreferencedbycallingthestaticmethodMemPipeline.getInstance().Alloperationsareinmemory,andtheuseofPTypesisminimal.

Page 383: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CrunchexamplesWewillnowuseApacheCrunchtoreimplementsomeoftheMapReducecodewrittensofarinamoremodularfashion.

Wordco-occurrenceInChapter3,Processing–MapReduceandBeyond,weshowedaMapReducejob,BiGramCount,tocountco-occurrencesofwordsintweets.ThatsamelogiccanbeimplementedasaDoFn.Insteadofemittingamulti-fieldkeyandhavingtoparseitatalaterstage,withCrunchwecanuseacomplextypePair<String,String>,asfollows:

classBiGramextendsDoFn<String,Pair<String,String>>{

@Override

publicvoidprocess(Stringtweet,

Emitter<Pair<String,String>>emitter){

String[]words=tweet.split("");

Textbigram=newText();

Stringprev=null;

for(Strings:words){

if(prev!=null){

emitter.emit(Pair.of(prev,s));

}

prev=s;

}

}

}

Noticehow,comparedtoMapReduce,theBiGramCrunchimplementationisastandaloneclass,easilyreusableinanyothercodebase.Thecodeforthisexampleisincludedinhttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/DataPreparationPipeline.java

TF-IDFWecanimplementtheTF-IDFchainofjobswithaMRPipeline,asfollows:

publicclassCrunchTermFrequencyInvertedDocumentFrequency

extendsConfiguredimplementsTool,Serializable{

privateLongnumDocs;

@SuppressWarnings("deprecation")

publicstaticclassTF{

Stringterm;

StringdocId;

intfrequency;

publicTF(){}

Page 384: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

publicTF(Stringterm,

StringdocId,Integerfrequency){

this.term=term;

this.docId=docId;

this.frequency=(int)frequency;

}

}

publicintrun(String[]args)throwsException{

if(args.length!=2){

System.err.println();

System.err.println("Usage:"+this.getClass().getName()+"

[genericoptions]inputoutput");

return1;

}

//Createanobjecttocoordinatepipelinecreationandexecution.

Pipelinepipeline=

newMRPipeline(TermFrequencyInvertedDocumentFrequency.class,getConf());

//enabledebugoptions

pipeline.enableDebug();

//ReferenceagiventextfileasacollectionofStrings.

PCollection<String>tweets=pipeline.readTextFile(args[0]);

numDocs=tweets.length().getValue();

//WeuseAvroreflectionstomaptheTFPOJOtoavsc

PTable<String,TF>tf=tweets.parallelDo(newTermFrequencyAvro(),

Avros.tableOf(Avros.strings(),Avros.reflects(TF.class)));

//CalculateDF

PTable<String,Long>df=Aggregate.count(tf.parallelDo(new

DocumentFrequencyString(),Avros.strings()));

//FinallywecalculateTF-IDF

PTable<String,Pair<TF,Long>>tfDf=Join.join(tf,df);

PCollection<Tuple3<String,String,Double>>tfIdf=

tfDf.parallelDo(newTermFrequencyInvertedDocumentFrequency(),

Avros.triples(

Avros.strings(),

Avros.strings(),

Avros.doubles()));

//Serializeasavro

tfIdf.write(To.avroFile(args[1]));

//ExecutethepipelineasaMapReduce.

PipelineResultresult=pipeline.done();

returnresult.succeeded()?0:1;

}

}

Page 385: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Theapproachthatwefollowherehasanumberofadvantagescomparedtostreaming.Firstofall,wedon’tneedtomanuallychainMapReducejobsusingaseparatescript.ThistaskisCrunch’smainpurpose.Secondly,wecanexpresseachcomponentofthemetricasadistinctclass,makingiteasiertoreuseinfutureapplications.

Toimplementtermfrequency,wecreateaDoFnclassthattakesasinputatweetandemitsPair<String,TF>.Thefirstelementisaterm,andthesecondisaninstanceofthePOJOclassthatwillbeserializedusingAvro.TheTFpartcontainsthreevariables:term,documentId,andfrequency.Inthereferenceimplementation,weexpectinputdatatobeaJSONstringthatwedeserializeandparse.Wealsoincludetokenizingasasubtaskoftheprocessmethod.

Dependingontheusecases,wecouldabstractbothoperationsinseparateDoFns,asfollows:

classTermFrequencyAvroextendsDoFn<String,Pair<String,TF>>{

publicvoidprocess(StringJSONTweet,

Emitter<Pair<String,TF>>emitter){

Map<String,Integer>termCount=newHashMap<>();

Stringtweet;

StringdocId;

JSONParserparser=newJSONParser();

try{

Objectobj=parser.parse(JSONTweet);

JSONObjectjsonObject=(JSONObject)obj;

tweet=(String)jsonObject.get("text");

docId=(String)jsonObject.get("id_str");

for(Stringterm:tweet.split("\\s+")){

if(termCount.containsKey(term.toLowerCase())){

termCount.put(term,

termCount.get(term.toLowerCase())+1);

}else{

termCount.put(term.toLowerCase(),1);

}

}

for(Entry<String,Integer>entry:termCount.entrySet()){

emitter.emit(Pair.of(entry.getKey(),newTF(entry.getKey(),

docId,entry.getValue())));

}

}catch(ParseExceptione){

e.printStackTrace();

}

}

}

}

Documentfrequencyisstraightforward.ForeachPair<String,TF>generatedintheterm

Page 386: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

frequencystep,weemittheterm—thefirstelementofthepair.WeaggregateandcounttheresultingPCollectionoftermstoobtaindocumentfrequency,asfollows:

classDocumentFrequencyStringextendsDoFn<Pair<String,TF>,String>{

@Override

publicvoidprocess(Pair<String,TF>tfAvro,

Emitter<String>emitter){

emitter.emit(tfAvro.first());

}

}

WefinallyjointhePTableTFwiththePTableDFonthesharedkey(term)andfeedtheresultingPair<String,Pair<TF,Long>>objecttoTermFrequencyInvertedDocumentFrequency.

Foreachtermanddocument,wecalculateTF-IDFandreturnaterm,docIf,andtfIdftriple:

classTermFrequencyInvertedDocumentFrequencyextendsMapFn<Pair<String,

Pair<TF,Long>>,Tuple3<String,String,Double>>{

@Override

publicTuple3<String,String,Double>map(

Pair<String,Pair<TF,Long>>input){

Pair<TF,Long>tfDf=input.second();

Longdf=tfDf.second();

TFtf=tfDf.first();

doubleidf=1.0+Math.log(numDocs/df);

doubletfIdf=idf*tf.frequency;

returnTuple3.of(tf.term,tf.docId,tfIdf);

}

}

WeuseMapFnbecausewearegoingtooutputonerecordforeachinput.Thesourcecodeforthisexamplecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/crunch/src/main/java/com/learninghadoop2/crunch/CrunchTermFrequencyInvertedDocumentFrequency.java

Theexamplecanbecompiledandexecutedwiththefollowingcommands:

$./gradlewjar

$./gradlewcopyJars

Ifnotalreadydone,addtheCrunchandAvrodependenciesdownloadedwithcopyJarstotheLIBJARSenvironmentvariable,asfollows:

$exportCRUNCH_DEPS=build/libjars/crunch-example/lib

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/crunch-core-0.9.0-

cdh5.0.3.jar,${CRUNCH_DEPS}/avro-1.7.5-cdh5.0.3.jar,${CRUNCH_DEPS}/avro-

mapred-1.7.5-cdh5.0.3-hadoop2.jar

Furthermore,addthejson-simpleJARtoLIBJARS:

$exportLIBJARS=${LIBJARS},${CRUNCH_DEPS}/json-simple-1.1.1.jar

Page 387: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Finally,runCrunchTermFrequencyInvertedDocumentFrequencyasaMapReducejob,asfollows:

$hadoopjarbuild/libs/crunch-example.jar\

com.learninghadoop2.crunch.CrunchTermFrequencyInvertedDocumentFrequency\

-libjars${LIBJARS}\

tweets.jsontweets.avro-out

Page 388: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

KiteMorphlinesKiteMorphlinesisadatatransformationlibrary,inspiredbyUnixpipes,originallydevelopedaspartofClouderaSearch.Amorphlineisanin-memorychainoftransformationcommandsthatreliesonapluginstructuretotapheterogeneousdatasources.ItusesdeclarativecommandstocarryoutETLoperationsonrecords.Commandsaredefinedinaconfigurationfile,whichislaterfedtoadriverclass.

ThegoalistomakeembeddingETLlogicintoanyJavacodebaseatrivialtaskbyprovidingalibrarythatallowsdeveloperstoreplaceprogrammingwithaseriesofconfigurationsettings.

ConceptsMorphlinesarebuiltaroundtwoabstractions:CommandandRecord.

Recordsareimplementationsoftheorg.kitesdk.morphline.api.Recordinterface:

publicfinalclassRecord{

privateArrayListMultimap<String,Object>fields;

privateRecord(ArrayListMultimap<String,Object>fields){…}

publicListMultimap<String,Object>getFields(){…}

publicListget(Stringkey){…}

publicvoidput(Stringkey,Objectvalue){…}

}

Arecordisasetofnamedfields,whereeachfieldhasalistofoneormorevalues.ARecordisimplementedontopofGoogleGuava’sListMultimapandArrayListMultimapclasses.NotethatavaluecanbeanyJavaobject,fieldscanbemultivalued,andtworecordsdon’tneedtousecommonfieldnames.Arecordcancontainan_attachment_bodyfieldthatcanbeajava.io.InputStreamorabytearray.

Commandsimplementtheorg.kitesdk.morphline.api.Commandinterface:

publicinterfaceCommand{

voidnotify(Recordnotification);

booleanprocess(Recordrecord);

CommandgetParent();

}

Acommandtransformsarecordintozeroormorerecords.CommandscancallthemethodsontheRecordinstanceprovidedforreadandwriteoperationsaswellasforaddingorremovingfields.

Commandsarechainedtogether,andateachstepofamorphlinetheparentcommandsendsrecordstoitschild,whichinturnprocessesthem.Informationbetweenparentsandchildrenisexchangedusingtwocommunicationchannels(planes);notificationsaresentviaacontrolplane,andrecordsaresentoveradataplane.Recordsareprocessedbytheprocess()method,whichreturnsaBooleanvaluetoindicatewhetheramorphlineshouldproceedornot.

Page 389: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Commandsarenotinstantiateddirectly,butviaanimplementationoftheorg.kitesdk.morphline.api.CommandBuilderinterface:

publicinterfaceCommandBuilder{

Collection<String>getNames();

Commandbuild(Configconfig,

Commandparent,

Commandchild,

MorphlineContextcontext);

}

ThegetNamesmethodreturnsthenameswithwhichthecommandcanbeinvoked.Multiplenamesaresupportedtoallowbackwardscompatiblenamechanges.Thebuild()methodcreatesandreturnsacommandrootedatthegivenmorphlineconfiguration.

Theorg.kitesdk.morphline.api.MorphlineContextinterfaceallowsadditionalparameterstobepassedtoallmorphlinecommands.

Thedatamodelofmorphlinesisstructuredfollowingasource-pipe-sinkpattern,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink.

MorphlinecommandsKiteMorphlinescomeswithanumberofdefaultcommandsthatimplementdatatransformationsoncommonserializationformats(plaintext,Avro,JSON).Currentlyavailablecommandsareorganizedassubprojectsofmorphlinesandinclude:

kite-morphlines-core-stdio:willreaddatafrombinarylargeobjects(BLOBs)andtextkite-morphlines-core-stdlib:wrapsaroundJavadatatypesfordatamanipulationandrepresentationkite-morphlines-avro:isusedforserializationintoanddeserializationfromdataintheAvroformatkite-morphlines-json:willserializeanddeserializedatainJSONformatkite-morphlines-hadoop-core:isusedtoaccessHDFSkite-morphlines-hadoop-parquet-avro:isusedtoserializeanddeserializedataintheParquetformatkite-morphlines-hadoop-sequencefile:isusedtoserializeanddeserializedataintheSequencefileformatkite-morphlines-hadoop-rcfile:isusedtoserializeanddeserializedatainRCfileformat

Alistofallavailablecommandscanbefoundathttp://kitesdk.org/docs/0.17.0/kite-morphlines/morphlinesReferenceGuide.html.

Commandsaredefinedbydeclaringachainoftransformationsinaconfigurationfile,morphline.conf,whichisthencompiledandexecutedbyadriverprogram.Forinstance,wecanspecifyaread_tweetsmorphlinethatwillloadtweetsstoredasJSONdata,serializeanddeserializethemusingJackson,andprintthefirst10,bycombiningthe

Page 390: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

defaultreadJsonandheadcommandscontainedintheorg.kitesdk.morphlinepackage,asfollows:

morphlines:[{

id:read_tweets

importCommands:["org.kitesdk.morphline.**"]

commands:[{

readJson{

outputClass:com.fasterxml.jackson.databind.JsonNode

}}

{

head{

limit:10

}}

]

}]

WewillnowshowhowthismorphlinecanbeexecutedbothfromastandaloneJavaprogramaswellasfromMapReduce.

MorphlineDriver.javashowshowtousethelibraryembeddedintoahostsystem.Thefirststepthatwecarryoutinthemainmethodistoloadmorphline’sJSONconfiguration,buildaMorphlineContextobject,andcompileitintoaninstanceofCommandthatactsasthestartingnodeofthemorphline.NotethatCompiler.compile()takesafinalChildparameter;inthiscase,itisRecordEmitter.WeuseRecordEmittertoactasasinkforthemorphline,byeitherprintingarecordtostdoutorstoringitintoHDFS.IntheMorphlineDriverexample,weuseorg.kitesdk.morphline.base.Notificationstomanageandmonitorthemorphlinelifecycleinatransactionalfashion.

AcalltoNotifications.notifyStartSession(morphline)startsthetransformationchainwithinatransactiondefinedbycallingNotifications.notifyBeginTransaction.Uponsuccess,weterminatethepipelinewithNotifications.notifyShutdown(morphline).Intheeventoffailure,werollbackthetransaction,Notifications.notifyRollbackTransaction(morphline),andpassanexceptionhandlerfromthemorphlinecontexttothecallingJavacode:

publicclassMorphlineDriver{

privatestaticfinalclassRecordEmitterimplementsCommand{

privatefinalTextline=newText();

@Override

publicCommandgetParent(){

returnnull;

}

@Override

publicvoidnotify(Recordrecord){

}

@Override

publicbooleanprocess(Recordrecord){

Page 391: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

line.set(record.get("_attachment_body").toString());

System.out.println(line);

returntrue;

}

}

publicstaticvoidmain(String[]args)throwsIOException{

/*loadamorphlineconfandsetitup*/

FilemorphlineFile=newFile(args[0]);

StringmorphlineId=args[1];

MorphlineContextmorphlineContext=new

MorphlineContext.Builder().build();

Commandmorphline=newCompiler().compile(morphlineFile,

morphlineId,morphlineContext,newRecordEmitter());

/*Preparethemorphlineforexecution

*

*Notificationsaresentthroughthecommunicationchannel

**/

Notifications.notifyBeginTransaction(morphline);

/*Notethatweareusingthelocalfilesystem,nothdfs*/

InputStreamin=newBufferedInputStream(new

FileInputStream(args[2]));

/*fillinarecordandpassitover*/

Recordrecord=newRecord();

record.put(Fields.ATTACHMENT_BODY,in);

try{

Notifications.notifyStartSession(morphline);

booleansuccess=morphline.process(record);

if(!success){

System.out.println("Morphlinefailedtoprocessrecord:"+

record);

}

/*Committhemorphline*/

}catch(RuntimeExceptione){

Notifications.notifyRollbackTransaction(morphline);

morphlineContext.getExceptionHandler().handleException(e,null);

}

finally{

in.close();

}

/*shutitdown*/

Notifications.notifyShutdown(morphline);

}

}

Inthisexample,weloaddatainJSONformatfromthelocalfilesystemintoanInputStreamobjectanduseittoinitializeanewRecordinstance.TheRecordEmitter

Page 392: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

classcontainsthelastprocessedrecordinstanceofthechain,onwhichweextract_attachment_bodyandprintittostandardoutput.ThesourcecodeforMorphlineDrivercanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriver.java

UsingthesamemorphlinefromaMapReducejobisstraightforward.DuringthesetupphaseoftheMapper,webuildacontextthatcontainstheinstantiationlogic,whilethemapmethodsetstheRecordobjectupandfiresofftheprocessinglogic,asfollows:

publicstaticclassReadTweets

extendsMapper<Object,Text,Text,NullWritable>{

privatefinalRecordrecord=newRecord();

privateCommandmorphline;

@Override

protectedvoidsetup(Contextcontext)

throwsIOException,InterruptedException{

FilemorphlineConf=newFile(context.getConfiguration()

.get(MORPHLINE_CONF));

StringmorphlineId=context.getConfiguration()

.get(MORPHLINE_ID);

MorphlineContextmorphlineContext=

newMorphlineContext.Builder()

.build();

morphline=neworg.kitesdk.morphline.base.Compiler()

.compile(morphlineConf,

morphlineId,

morphlineContext,

newRecordEmitter(context));

}

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

record.put(Fields.ATTACHMENT_BODY,

newByteArrayInputStream(

value.toString().getBytes("UTF8")));

if(!morphline.process(record)){

System.out.println(

"Morphlinefailedtoprocessrecord:"+record);

}

record.removeAll(Fields.ATTACHMENT_BODY);

}

}

IntheMapReducecodewemodifyRecordEmittertoextracttheFieldspayloadfrompost-processedrecordsandstoreitintocontext.ThisallowsustowritedataintoHDFSbyspecifyingaFileOutputFormatintheMapReduceconfigurationboilerplate:

privatestaticfinalclassRecordEmitterimplementsCommand{

privatefinalTextline=newText();

privatefinalMapper.Contextcontext;

privateRecordEmitter(Mapper.Contextcontext){

Page 393: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

this.context=context;

}

@Override

publicvoidnotify(Recordnotification){

}

@Override

publicCommandgetParent(){

returnnull;

}

@Override

publicbooleanprocess(Recordrecord){

line.set(record.get(Fields.ATTACHMENT_BODY).toString());

try{

context.write(line,null);

}catch(Exceptione){

e.printStackTrace();

returnfalse;

}

returntrue;

}

}

Noticethatwecannowchangetheprocessingpipelinebehaviorandaddfurtherdatatransformationsbymodifyingmorphline.confwithouttheexplicitneedtoaltertheinstantiationandprocessinglogic.TheMapReducedriversourcecodecanbefoundathttps://github.com/learninghadoop2/book-examples/blob/master/ch9/kite/src/main/java/com/learninghadoop2/kite/morphlines/MorphlineDriverMapReduce.java

Bothexamplescanbecompiledfromch9/kite/withthefollowingcommands:

$./gradlewjar

$./gradlewcopyJar

WeaddtheruntimedependenciestoLIBJARS,asfollows

$exportKITE_DEPS=/home/cloudera/review/hadoop2book-private-reviews-

gabriele-ch8/src/ch8/kite/build/libjars/kite-example/lib

exportLIBJARS=${LIBJARS},${KITE_DEPS}/kite-morphlines-core-

0.17.0.jar,${KITE_DEPS}/kite-morphlines-json-

0.17.0.jar,${KITE_DEPS}/metrics-core-3.0.2.jar,${KITE_DEPS}/metrics-

healthchecks-3.0.2.jar,${KITE_DEPS}/config-1.0.2.jar,${KITE_DEPS}/jackson-

databind-2.3.1.jar,${KITE_DEPS}/jackson-core-

2.3.1.jar,${KITE_DEPS}/jackson-annotations-2.3.0.jar

WecanruntheMapReducedriverwiththefollowing:

$hadoopjarbuild/libs/kite-example.jar\

com.learninghadoop2.kite.morphlines.MorphlineDriverMapReduce\

-libjars${LIBJARS}\

morphline.conf\

read_tweets\

tweets.json\

morphlines-out

Page 394: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TheJavastandalonedrivercanbeexecutedwiththefollowingcommand:

$exportCLASSPATH=${CLASSPATH}:${KITE_DEPS}/kite-morphlines-core-

0.17.0.jar:${KITE_DEPS}/kite-morphlines-json-

0.17.0.jar:${KITE_DEPS}/metrics-core-3.0.2.jar:${KITE_DEPS}/metrics-

healthchecks-3.0.2.jar:${KITE_DEPS}/config-1.0.2.jar:${KITE_DEPS}/jackson-

databind-2.3.1.jar:${KITE_DEPS}/jackson-core-

2.3.1.jar:${KITE_DEPS}/jackson-annotations-2.3.0.jar:${KITE_DEPS}/slf4j-

api-1.7.5.jar:${KITE_DEPS}/guava-11.0.2.jar:${KITE_DEPS}/hadoop-common-

2.3.0-cdh5.0.3.jar

$java-cp$CLASSPATH:./build/libs/kite-example.jar\

com.learninghadoop2.kite.morphlines.MorphlineDriver\

morphline.conf\

read_tweetstweets.json\

morphlines-out

Page 395: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryInthischapter,weintroducedfourtoolstoeasedevelopmentonHadoop.Inparticular,wecovered:

HowHadoopstreamingallowsthewritingofMapReducejobsusingdynamiclanguagesHowKiteDatasimplifiesinterfacingwithheterogeneousdatasourcesHowApacheCrunchprovidesahigh-levelabstractiontowritepipelinesofSparkandMapReducejobsthatimplementcommondesignpatternsHowMorphlinesallowsustodeclarechainsofcommandsanddatatransformationsthatcanthenbeembeddedinanyJavacodebase

InChapter10,RunningaHadoop2Cluster,wewillshiftourfocusfromthedomainofsoftwaredevelopmenttosystemadministration.Wewilldiscusshowtosetup,manage,andscaleaHadoopcluster,whiletakingaspectssuchasmonitoringandsecurityintoconsideration.

Page 396: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter10.RunningaHadoopClusterInthischapter,wewillchangeourfocusalittleandlookatsomeoftheconsiderationsyouwillfacewhenrunninganoperationalHadoopcluster.Inparticular,wewillcoverthefollowingtopics:

WhyadevelopershouldcareaboutoperationsandwhyHadoopoperationsaredifferentMoredetailonClouderaManageranditscapabilitiesandlimitationsDesigningaclusterforuseonbothphysicalhardwareandEMRSecuringaHadoopclusterHadoopmonitoringTroubleshootingproblemswithanapplicationrunningonHadoop

Page 397: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

I’madeveloper–Idon’tcareaboutoperations!Beforegoinganyfurther,weneedtoexplainwhyweareputtingachapteraboutsystemsoperationsinabooksquarelyaimedatdevelopers.Foranyonewhohasdevelopedformoretraditionalplatforms(forexample,webapps,databaseprogramming,andsoon)thenthenormmightwellhavebeenforaverycleardelineationbetweendevelopmentandoperations.Thefirstgroupbuildsthecodeandpackagesitup,andthesecondgroupcontrolsandoperatestheenvironmentinwhichitruns.

Inrecentyears,theDevOpsmovementhasgainedmomentumwithabeliefthatitisbestforeveryoneifthesesilosareremovedandthattheteamsworkmorecloselytogether.WhenitcomestorunningsystemsandservicesbasedonHadoop,webelievethisisabsolutelyessential.

Page 398: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HadoopandDevOpspracticesEventhoughadevelopercanconceptuallybuildanapplicationreadytobedroppedintoYARNandforgottenabout,therealityisoftenmorenuanced.Howmanyresourcesareallocatedtotheapplicationatruntimeismostlikelysomethingthedeveloperwishestoinfluence.Oncetheapplicationisrunning,theoperationsstafflikelywantsomeinsightintotheapplicationwhentheyaretryingtooptimizethecluster.Therereallyisn’tthesameclear-cutsplitofresponsibilitiesseenintraditionalenterpriseIT.Andthat’slikelyareallygoodthing.

Inotherwords,developersneedtobemoreawareoftheoperationsaspects,andtheoperationsstaffneedtobemoreawareofwhatthedevelopersaredoing.Soconsiderthischapterourcontributiontohelpyouhavethosediscussionswithyouroperationsstaff.Wedon’tintendtomakeyouanexpertHadoopadministratorbytheendofthischapter;thatreallyisemergingasadedicatedroleandskillsetinitself.Instead,wewillgiveawhistle-stoptourofissuesyoudoneedsomeawarenessofandthatwillmakeyourlifeeasieronceyourapplicationsarerunningonliveclusters.

Bythenatureofthiscoverage,wewillbetouchingonalotoftopicsandgoingintothemonlylightly;ifanyareofdeeperinterest,thenweprovidelinksforfurtherinvestigation.Justmakesureyoukeepyouroperationsstaffinvolved!

Page 399: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClouderaManagerInthisbook,weusedasthemostcommonplatformtheClouderaHadoopDistribution(CDH)withitsconvenientQuickStartvirtualmachineandthepowerfulClouderaManagerapplication.WithaCloudera-basedcluster,ClouderaManagerwillbecome(atleastinitially)yourprimaryinterfaceintothesystemtomanageandmonitorthecluster,solet’sexploreitalittle.

NotethatClouderaManagerhasextensiveandhigh-qualityonlinedocumentation.Wewon’tduplicatethisdocumentationhere;insteadwe’llattempttohighlightwhereClouderaManagerfitsintoyourdevelopmentandoperationalworkflowsandhowitmightormightnotbesomethingyouwanttoembrace.DocumentationforthelatestandpreviousversionsofClouderaManagercanbeaccessedviathemainClouderadocumentationpageathttp://www.cloudera.com/content/support/en/documentation.html.

Page 400: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TopayornottopayBeforegettingallexcitedaboutClouderaManager,it’simportanttoconsultthecurrentdocumentationconcerningwhatfeaturesareavailableinthefreeversionandwhichonesrequiresubscriptiontoapaid-forClouderaoffering.Ifyouabsolutelywantsomeofthefeaturesofferedonlyinthepaid-forversionbuteithercan’tordon’twishtopayforsubscriptionservices,thenClouderaManager,andpossiblytheentireClouderadistribution,mightnotbeagoodfitforyou.We’llreturntothistopicinChapter11,WheretoGoNext.

Page 401: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClustermanagementusingClouderaManagerUsingtheQuickStartVM,itwon’tbeobvious,butClouderaManageristheprimarytooltobeusedformanagementofallservicesinthecluster.Ifyouwanttoenableanewservice,you’lluseClouderaManager.Tochangeaconfiguration,youwillneedClouderaManager.Toupgradetothelatestrelease,youwillagainrequireClouderaManager.

Eveniftheprimarymanagementoftheclusterishandledbyoperationalstaff,asadeveloperyou’lllikelystillwanttobecomefamiliarwiththeClouderaManagerinterfacejusttolooktoseeexactlyhowtheclusterisconfigured.Ifyourjobsarerunningslowly,thenlookingintoClouderaManagertoseejusthowthingsarecurrentlyconfiguredwilllikelybeyourfirststart.ThedefaultportfortheClouderaManagerwebinterfaceis7180,sothehomepagewillusuallybeconnectedtoviaaURLsuchashttp://<hostname>:7180/cmf/home,andcanbeseeninthefollowingscreenshot:

ClouderaManagerhomepage

It’sworthpokingaroundtheinterface;however,ifyouareconnectingwithauseraccountwithadminprivileges,becareful!

ClickontheClusterslink,andthiswillexpandtogivealistoftheclusterscurrentlymanagedbythisinstanceofClouderaManager.ThisshouldtellyouthatasingleClouderaManagerinstancecanmanagemultipleclusters.Thisisveryuseful,especiallyifyouhavemanyclustersspreadacrossdevelopmentandproduction.

Foreachexpandedcluster,therewillbealistoftheservicescurrentlyrunningonthecluster.Clickonaservice,andthenyouwillseealistofadditionalchoices.SelectConfiguration,andyoucanstartbrowsingthedetailedconfigurationofthatparticularservice.ClickonActions,andyouwillgetsomeservice-specificoptions;thiswillusuallyincludestopping,starting,restarting,andotherwisemanagingtheservice.

Page 402: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClickontheHostsoptioninsteadofClusters,andyoucanstartdrillingdownintotheserversmanagedbyClouderaManager,andfromthere,seewhichservicecomponentsaredeployedoneach.

ClouderaManagerandothermanagementtoolsThatlastcommentmightraiseaquestion:howdoesClouderaManagerintegratewithothersystemsmanagementtools?GivenourearliercommentsregardingtheimportanceofDevOpsphilosophies,howwelldoesitintegratewiththetoolsfavoredinDevOpsenvironments?

Thehonestanswer:notalwaysverywell.ThoughthemainClouderaManagerservercanitselfbemanagedbyautomationtools,suchasPuppetorChef,thereisanexplicitassumptionthatClouderaManagerwillcontroltheinstallationandconfigurationofallthesoftwareClouderaManagerneedsonallthehoststhatwillbeincludedinitsclusters.Tosomeadministrators,thismakesthehardwarebehindClouderaManagerlooklikeabig,blackbox;theymightcontroltheinstallationofthebaseoperatingsystem,butthemanagementoftheconfigurationbaselinegoingforwardisentirelymanagedbyClouderaManager.There’snothingmuchtobedonehere;itiswhatitis—togetthebenefitsofClouderaManager,itwilladditselfasanewmanagementsysteminyourinfrastructure,andhowwellthatfitsinwithyourbroaderenvironmentwillbedeterminedonacase-by-casebasis.

Page 403: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MonitoringwithClouderaManagerAsimilarpointcanbemaderegardingsystemsmonitoringasClouderaManagerisalsoconceptuallyapointofduplicationhere.Butstartclickingaroundtheinterface,anditwillbecomeapparentveryquicklythatClouderaManagerprovidesanexceptionallyrichsetoftoolstoassessthehealthandperformanceofmanagedclusters.

FromgraphingtherelativeperformanceofImpalaqueriesthroughshowingthejobstatusforYARNapplicationsandgivinglow-leveldataontheblocksstoredonHDFS,itisallthereinasingleinterface.We’lldiscusslaterinthischapterhowtroubleshootingonHadoopcanbechallenging,butthesinglepointofvisibilityprovidedbyClouderaManagerisagreattoolwhenlookingtoassessclusterhealthorperformance.We’lldiscussmonitoringinalittlemoredetaillaterinthischapter.

FindingconfigurationfilesOneofthefirstconfusionsfacedwhenrunningaclustermanagedbyClouderaManageristryingtofindtheconfigurationfilesusedbythecluster.InthevanillaApachereleasesofproducts,suchasthecoreHadoop,therewouldbefilestypicallystoredin/etc/hadoop,similarly/etc/hiveforHive,/etc/oozieforOozie,andsoon.

InaClouderaManagermanagedcluster,however,theconfigfilesareregeneratedeachtimeaserviceisrestarted,andinsteadofsittinginthe/etclocationsonthefilesystem,willbefoundat/var/run/cloudera-scm-agent-process/<pid>-<taskname>/,wherethelastdirectorymighthaveanamesuchas7007-yarn-NODEMANAGER.ThismightseemoddtoanyoneusedtoworkingonearlierHadoopclustersorotherdistributionsthatdon’tdosuchathing.ButinaClouderaManager-controlledcluster,itmightoftenbeeasiertousethewebinterfacetobrowsetheconfigurationinsteadoflookingfortheunderlyingconfigfiles.Whichapproachisbest?Thisisalittlephilosophical,andeachteamneedstodecidewhichworksbestforthem.

Page 404: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClouderaManagerAPIWe’veonlygiventhehighestlevelofoverviewofClouderaManager,andindoingso,havecompletelyignoredoneareathatmightbeveryusefulforsomeorganizations:ClouderaManageroffersanAPIthatallowsintegrationofitscapabilitiesintoothersystemsandtools.Consultthedocumentationifthismightbeofinteresttoyou.

Page 405: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClouderaManagerlock-inThisbringsustothepointthatisimplicitinthewholediscussionaroundClouderaManager:itdoescauseadegreeoflock-intoClouderaandtheirdistribution.Thatlock-inmightonlyexistincertainways;code,forexample,shouldbeportableacrossclustersmodulotheusualcaveatsaboutdifferentunderlyingversions—buttheclusteritselfmightnoteasilybereconfiguredtouseadifferentdistribution.Assumethatswitchingdistributionswouldbeacompleteremove/reformat/reinstallactivity.

Wearen’tsayingdon’tuseit,ratherthatyouneedtobeawareofthelock-inthatcomeswiththeuseofClouderaManager.Forsmallteamswithlittlededicatedoperationssupportorexistinginfrastructure,theimpactofsuchalock-inislikelyoutweighedbythesignificantcapabilitiesthatClouderaManagergivesyou.

Forlargerteamsoronesworkinginanenvironmentwhereintegrationwithexistingtoolsandprocesseshasmoreweight,thedecisionmightbelessclear.LookatClouderaManager,discusswithyouroperationspeople,anddeterminewhatisrightforyou.

NotethatitispossibletomanuallydownloadandinstallthevariouscomponentsoftheClouderadistributionwithoutusingClouderaManagertomanagetheclusteranditshosts.ThismightbeanattractivemiddlegroundforsomeusersastheClouderasoftwarecanbeused,butdeploymentandmanagementcanbebuiltintotheexistingdeploymentandmanagementtools.Thisisalsopotentiallyawayofavoidingtheadditionalexpenseofthepaid-forlevelsofClouderasupportmentionedearlier.

Page 406: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Ambari–theopensourcealternativeAmbariisanApacheproject(http://ambari.apache.org),whichintheory,providesanopensourcealternativetoClouderaManager.ItistheadministrationconsolefortheHortonworksdistribution.AtthetimeofwritingHortonworksemployeesarealsothevastmajorityoftheprojectcontributors.

Ambari,asonewouldexpectgivenitsopensourcenature,reliesonotheropensourceproducts,suchasPuppetandNagios,toprovidethemanagementandmonitoringofitsmanagedclusters.Italsohashigh-levelfunctionalitysimilartoClouderaManager,thatis,theinstallation,configuration,management,andmonitoringofaHadoopcluster,andthecomponentserviceswithinit.

ItisgoodtobeawareoftheAmbariprojectasthechoiceisnotjustbetweenfulllock-intoClouderaandClouderaManageroramanuallymanagedcluster.Ambariprovidesagraphicaltoolthatmightbeworthconsideration,orindeedinvolvement,asitmatures.OnanHDPcluster,theAmbariUIequivalenttotheClouderaManagerhomepageshownearliercanbereachedathttp://<hostname>:8080/#/main/dashboardandlookslikethefollowingscreenshot:

Ambari

Page 407: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OperationsintheHadoop2worldAsmentionedinChapter2,Storage,someofthemostsignificantchangesmadetoHDFSinHadoop2involveitsfaulttoleranceandbetterintegrationwithexternalsystems.Thisisnotjustacuriosity,buttheNameNodeHighAvailabilityfeatures,inparticular,havemadeamassivedifferenceinthemanagementofclusterssinceHadoop1.Inthebadolddaysof2012orso,asignificantpartoftheoperationalpreparednessofaHadoopclusterwasbuiltaroundmitigationsfor,andrestorationprocessesaroundfailureoftheNameNode.IftheNameNodediedinHadoop1,andyoudidn’thaveabackupoftheHDFSfsimagemetadatafile,thenyoubasicallylostaccesstoallyourdata.Ifthemetadatawaspermanentlylost,thensowasthedata.

Hadoop2hasaddedthein-builtNameNodeHAandthemachinerytomakeitwork.Inaddition,therearecomponentssuchastheNFSgatewayintoHDFS,whichmakeitamuchmoreflexiblesystem.Butthisadditionalcapabilitydoescomeattheexpenseofmoremovingparts.ToenableNameNodeHA,thereareadditionalcomponentsintheJournalManagerandFailoverController,andtheNFSgatewayrequiresHadoop-specificimplementationsoftheportmapandnfsdservices.

Hadoop2alsonowhasextensiveotherintegrationpointswithexternalservicesaswellasamuchbroaderselectionofapplicationsandservicesthatrunatopit.Consequently,itmightbeusefultoviewHadoop2intermsofoperationsashavingtradedthesimplicityofHadoop1foradditionalcomplexity,whichdeliversasubstantiallymorecapableplatform.

Page 408: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SharingresourcesInHadoop1,theonlytimeonehadtoconsiderresourcesharingwasinconsideringwhichschedulertousefortheMapReduceJobTracker.SincealljobswereeventuallytranslatedintoMapReducecodehavingapolicyforresourcesharingattheMapReducelevelwasusuallysufficienttomanageclusterworkloadsinthelarge.

Hadoop2andYARNchangedthispicture.AswellasrunningmanyMapReducejobs,aclustermightalsoberunningmanyotherapplicationsatopotherYARNApplicationMasters.TezandSparkareframeworksintheirownrightthatrunadditionalapplicationsatoptheirprovidedinterfaces.

IfeverythingrunsonYARN,thenitprovideswaysofconfiguringthemaximumresourceallocation(intermsofCPU,memory,andsoonI/O)consumedbyeachcontainerallocatedtoanapplication.Theprimarygoalhereistoensurethatenoughresourcesareallocatedtokeepthehardwarefullyutilizedwithouteitherhavingunusedcapacityoroverloadingit.

Thingsgetsomewhatmoreinterestingwhennon-YARNapplications,suchasImpala,arerunningontheclusterandwanttograballocatedslicesofcapacity(particularlymemoryinthecaseofImpala).Thiscouldalsohappenif,say,youwererunningSparkonthesamehostsinitsnon-YARNmodeorindeedanyotherdistributedapplicationthatmightbenefitfromco-locationontheHadoopmachines.

Basically,inHadoop2,youneedtothinkoftheclusterasmuchmoreofamulti-tenancyenvironmentthatrequiresmoreattentiongiventotheallocationofresourcestothevarioustenants.

Therereallyisnosilverbulletrecommendationhere;therightconfigurationwillbeentirelydependentontheservicesco-locatedandtheworkloadstheyarerunning.Thisisanotherexamplewhereyouwanttoworkcloselywithyouroperationsteamtodoaseriesofloadtestswiththresholdstodeterminejustwhattheresourcerequirementsofthevariousclientsareandwhichapproachwillgivethemaximumutilizationandperformance.ThefollowingblogpostfromClouderaengineersgivesagoodoverviewofhowtheyapproachthisveryissueinhavingImpalaandMapReducecoexisteffectively:http://blog.cloudera.com/blog/2013/06/configuring-impala-and-mapreduce-for-multi-tenant-performance/.

Page 409: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BuildingaphysicalclusterThereisoneminorrequirementbeforethinkingaboutallocationofhardwareresources:definingandselectingthehardwareusedforyourcluster.Inthissection,we’lldiscussaphysicalclusterandmoveontoAmazonEMRinthenext.

Anyspecifichardwareadvicewillbeoutofdatethemomentitiswritten.WeadviseperusingthewebsitesofthevariousHadoopdistributionvendorsastheyregularlywritenewarticlesonthecurrentlyrecommendedconfigurations.

InsteadoftellingyouhowmanycoresorGBofmemoryyouneed,we’lllookathardwareselectionataslightlyhigherlevel.ThefirstthingtorealizeisthatthehostsrunningyourHadoopclusterwillmostlikelylookverydifferentfromtherestofyourenterprise.Hadoopisoptimizedforlow(er)costhardware,soinsteadofseeingasmallnumberofverylargeservers,expecttoseealargernumberofmachineswithfewerenterprisereliabilityfeatures.Butdon’tthinkthatHadoopwillrungreatonanyjunkyouhavelyingaround.Itmight,butrecentlytheprofileoftypicalHadoopservershasbeenmovingawayfromthebottom-endofthemarket,andinstead,thesweetspotwouldseemtobeinmid-rangeserverswherethemaximumcores/disks/memorycanbeachievedatapricepoint.

YoushouldalsoexpecttohavedifferentresourcerequirementsforthehostsrunningservicessuchastheHDFSNameNodeortheYARNResourceManager,asopposedtotheworkernodesstoringdataandexecutingtheapplicationlogic.Fortheformer,thereisusuallymuchlessrequirementforlotsofstorage,butfrequently,aneedformorememoryandpossiblyfasterdisks.

ForHadoopworkernodes,theratiobetweenthethreemainhardwarecategoriesofcores,memory,andI/Oisoftenthemostimportantthingtogetright.Andthiswilldirectlyinformthedecisionsyoumakeregardingworkloadandresourceallocation.

Forexample,manyworkloadstendtobecomeI/Oboundandhavingmanytimesasmanycontainersallocatedonahostthantherearephysicaldisksmightactuallycauseanoverallslowdownduetocontentionforthespinningdisks.Atthetimeofwriting,currentrecommendationshereareforthenumberofYARNcontainerstobenomorethan1.8timesthenumberofdisks.IfyouhaveworkloadsthatareI/Obound,thenyouwillmostlikelygetmuchbetterperformancebyaddingmorehoststotheclusterinsteadoftryingtogetmorecontainersrunningorindeedfasterprocessorsormorememoryonthecurrenthosts.

Conversely,ifyouexpecttorunlotsofconcurrentImpala,Spark,andothermemory-hungryjobs,thenmemorymightquicklybecometheresourcemostunderpressure.Thisiswhyeventhoughyoucangetcurrenthardwarerecommendationsforgeneral-purposeclustersfromthedistributionvendors,youstillneedtovalidateagainstyourexpectedworkloadsandtailoraccordingly.ThereisreallynosubstituteforbenchmarkingonasmalltestclusterorindeedonEMR,whichcanbeagreatplatformtoexploretheresourcerequirementsofmultipleapplicationsthatcaninformhardwareacquisitiondecisions.PerhapsEMRmightbeyourmainenvironment;ifso,we’lldiscussthatinalatersection.

Page 410: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,
Page 411: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PhysicallayoutIfyoudouseaphysicalcluster,thereareafewthingsyouwillneedtoconsiderthatarelargelytransparentonEMR.

RackawarenessThefirstoftheseaspectsforclusterslargeenoughtoconsumemorethanonerackofdatacenterspaceisbuildingrackawareness.AsmentionedinChapter2,Storage,whenHDFSplacesreplicasofnewfiles,itattemptstoplacethesecondreplicaonadifferenthostthanthefirst,andthethirdinadifferentrackofequipmentinamulti-racksystem.Thisheuristicisaimedatmaximizingresilience;therewillbeatleastonereplicaavailableevenifanentirerackofequipmentfails.MapReduceusessimilarlogictoattempttogetabetter-balancedtaskspread.

Ifyoudonothing,theneachhostwillbespecifiedasbeinginthesingledefaultrack.But,iftheclustergrowsbeyondthispoint,youwillneedtoupdatetherackname.

Underthecovers,Hadoopdiscoversanode’srackbyexecutingauser-suppliedscriptthatmapsnodehostnametoracknames.ClouderaManagerallowsracknamestobesetonagivenhost,andthisisthenretrievedwhenitsrackawarenessscriptsarecalledbyHadoop.Tosettherackforahost,clickonHosts-><hostname>->AssignRack,andthenassigntherackfromtheClouderaManagerhomepage.

ServicelayoutAsmentionedearlier,youarelikelytohavetwotypesofhardwareinyourcluster:themachinesrunningtheworkersandthoserunningtheservers.Whendeployingaphysicalcluster,youwillneedtodecidewhichservicesandwhichsubcomponentsoftheservicesrunonwhichphysicalmachines.

Fortheworkers,thisisusuallyprettystraightforward;most,thoughnotall,serviceshaveamodelofaworkeragentonallworkerhosts.But,forthemaster/servercomponents,itrequiresalittlethought.Ifyouhavethreemasternodes,thenhowdoyouspreadyourprimaryandbackupNameNodes:theYARNResourceManager,maybeHue,afewHiveservers,andanOoziemanager?Someofthesefeaturesarehighlyavailable,whileothersarenot.Asyouaddmoreandmoreservicestoyourcluster,you’llalsoseethislistofmasterservicesgrowsubstantially.

Inanidealworld,youmighthaveahostperservicemasterbutthatisonlytractableforverylargeclusters;insmallerinstallationsitisprohibitivelyexpensive.Plusitmightalwaysbealittlewasteful.Therearenohard-and-fastruleshereeither,butdolookatyouravailablehardware,andtrytospreadtheservicesacrossthenodesasmuchaspossible.Don’t,forexample,havetwonodesforthetwoNameNodesandthenputeverythingelseonathird.Thinkabouttheimpactofasinglehostfailureandmanagethelayouttominimizeit.Astheclustergrowsacrossmultipleracksofequipment,theconsiderationswillalsoneedtoconsiderhowtosurvivesingle-rackfailures.HadoopitselfhelpswiththissinceHDFSwillattempttoensureeachblockofdatahasreplicasacrossatleasttwo

Page 412: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

racks.But,thistypeofresilienceisunderminedif,forexample,allthemasternodesresideinasinglerack.

UpgradingaserviceUpgradingHadoophashistoricallybeenatime-consumingandsomewhatriskytask.Thisremainsthecaseonamanuallydeployedcluster,thatis,onenotmanagedbyatoolsuchasClouderaManager.

IfyouareusingClouderaManager,thenittakesthetime-consumingpartoutoftheactivity,butnotnecessarilytherisk.Anyupgradeshouldalwaysbeviewedasanactivitywithahighchanceofunexpectedissues,andyoushouldarrangeenoughclusterdowntimetoaccountforthissurpriseexcitement.There’sreallynosubstitutefordoingatestupgradeonatestcluster,whichunderlinestheimportanceofthinkingaboutHadoopasacomponentofyourenvironmentthatneedstobetreatedwithadeploymentlifecyclelikeanyother.

SometimesanupgraderequiresmodificationtotheHDFSmetadataormightotherwiseaffectthefilesystem.Thisis,ofcourse,wheretherealriskslie.Inadditiontorunningatestupgrade,beawareoftheabilitytosetHDFSinupgrademode,whicheffectivelymakesasnapshotofthefilesystemstatepriortotheupgradeandwhichwillberetaineduntiltheupgradeisfinalized.Thiscanbereallyhelpfulasevenanupgradethatgoesbadlywrongandcorruptsdatacanpotentiallybefullyrolledback.

Page 413: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BuildingaclusteronEMRElasticMapReduceisaflexiblesolutionthat,dependingonrequirementsandworkloads,cansitnextto,orreplace,aphysicalHadoopcluster.Aswe’veseensofar,EMRprovidesclusterspreloadedandconfiguredwithHive,Streaming,andPigaswellaswithcustomJARclustersthatallowtheexecutionofMapReduceapplications.

Aseconddistinctiontomakeisbetweentransientandlong-runninglifecycles.AtransientEMRclusterisgeneratedondemand;dataisloadedinS3orHDFS,someprocessingworkflowisexecuted,outputresultsarestored,andtheclusterisautomaticallyshutdown.Along-runningclusteriskeptaliveoncetheworkflowterminates,andtheclusterremainsavailablefornewdatatobecopiedoverandnewworkflowstobeexecuted.Long-runningclustersaretypicallywell-suitedfordatawarehousingorworkingwithdatasetslargeenoughthatloadingandprocessingdatawouldbeinefficientcomparedtoatransientinstance.

Inamust-readwhitepaperforprospectiveusers(foundathttps://media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf),Amazongivesaheuristictoestimatewhichclustertypeisabetterfitasfollows:

Ifnumberofjobsperday*(timetosetupclusterincludingAmazonS3dataloadtimeifusingAmazonS3+dataprocessingtime)<24hours,considertransientAmazonEMRclustersorphysicalinstances.Long-runninginstancesareinstantiatedbypassingthe–aliveargumenttotheElasticMapreducecommand,whichenablestheKeepAliveoptionanddisablesautotermination.

Notethattransientandlong-runningclusterssharethesamepropertiesandlimitations;inparticular,dataonHDFSisnotpersistedoncetheclusterisshutdown.

Page 414: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ConsiderationsaboutfilesystemsInourexamplessofarweassumeddatatobeavailableinS3.Inthiscase,abucketismountedinEMRasans3nfilesystem,anditisusedasinputsourceaswellasatemporaryfilesystemtostoreintermediatedataincomputations.WithS3weintroducepotentialI/Ooverhead,operationssuchasreadsandwritesfireoffGETandPUTHTTPrequests.

NoteNotethatEMRdoesnotsupportS3blockstorage.Thes3URImapstos3n.

AnotheroptionwouldbetoloaddataintotheclusterHDFSandrunprocessingfromthere.Inthiscase,wedohavefasterI/Oanddatalocality,butwewouldlosepersistence.Whentheclusterisshutdown,ourdatadisappears.Asaruleofthumb,ifyouarerunningatransientcluster,itmakessensetouseS3asabackend.Inpractice,oneshouldmonitorandtakedecisionsbasedontheworkflowcharacteristics.Iterative,multi-passMapReducejobswouldgreatlybenefitfromHDFS;onecouldarguethatforthosetypesofworkflows,anexecutionenginelikeTezorSparkwouldbemoreappropriate.

Page 415: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GettingdataintoEMRWhencopyingdatafromHDFStoS3,itisrecommendedtouses3distcp(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.htmlinsteadofApachedistcporHadoopdistcp.ThisapproachissuitablealsototransferdatawithinEMRandfromS3toHDFS.TomoveverylargeamountsofdatafromthelocaldiskintoS3,AmazonrecommendsparallelizingtheworkloadusingJets3torGNUParallel.Ingeneral,it’simportanttobeawarethatPUTrequeststoS3arecappedat5GBperfile.Touploadlargerfiles,oneneedstorelyonMultipartUpload(https://aws.amazon.com/about-aws/whats-new/2010/11/10/Amazon-S3-Introducing-Multipart-Upload/),anAPIthatallowssplittinglargefilesintosmallerpartsandreassemblesthemwhenuploaded.FilescanalsobecopiedwithtoolssuchastheAWSCLIorthepopularS3CMDutility,butthesedonothavetheparallelismadvantagesofass3distcp.

Page 416: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

EC2instancesandtuningThesizeofanEMRclusterdependsonthedatasetsize,thenumberoffilesandblocks(determinesthenumberofsplits)andthetypeofworkload(trytoavoidspillingtodiskwhenataskrunsoutofmemory).Asaruleofthumb,agoodsizeisonethatmaximizesparallelism.ThenumberofmappersandreducersperinstanceaswellasheapsizeperJVMdaemonisgenerallyconfiguredbyEMRwhentheclusterisprovisionedandtunedintheeventofchangesintheavailableresources.

Page 417: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClustertuningInadditiontothepreviouscommentsspecifictoaclusterrunonEMR,therearesomegeneralthoughtstokeepinmindwhenrunningworkloadsonanytypeofcluster.Thiswill,ofcourse,bemoreexplicitwhenrunningoutsideofEMRasitoftenabstractssomeofthedetails.

Page 418: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

JVMconsiderationsYoushouldberunningthe64-bitversionofaJVMandusingtheservermode.Thiscantakelongertoproduceoptimizedcode,butitalsousesmoreaggressivestrategiesandwillre-optimizecodeovertime.Thismakesitamuchbetterfitforlong-runningservices,suchasHadoopprocesses.

EnsurethatyouallocateenoughmemorytotheJVMtopreventoverly-frequentGarbageCollection(GC)pauses.Theconcurrentmark-and-sweepcollectoriscurrentlythemosttestedandrecommendedforHadoop.TheGarbageFirst(G1)collectorhasbecometheGCoptionofchoiceinnumerousotherworkloadssinceitsintroductionwithJDK7,soit’sworthmonitoringrecommendedbestpracticeasitevolves.TheseoptionscanbeconfiguredascustomJavaargumentswithineachservice’sconfigurationsectionofClouderaManager.

ThesmallfilesproblemHeapallocationtoJavaprocessesonworkernodeswillbesomethingyouconsiderwhenthinkingaboutserviceco-location.ButthereisaparticularsituationregardingtheNameNodeyoushouldbeawareof:thesmallfilesproblem.

Hadoopisoptimizedforverylargefileswithlargeblocksizes.ButsometimesparticularworkloadsordatasourcespushmanysmallfilesontoHDFS.Thisismostlikelysuboptimalasitsuggestseachtaskprocessingablockatatimewillreadonlyasmallamountofdatabeforecompleting,causinginefficiency.

HavingmanysmallfilesalsoconsumesmoreNameNodememory;itholdsin-memorythemappingfromfilestoblocksandconsequentlyholdsmetadataforeachfileandblock.Ifthenumberoffilesandhenceblocksincreasesquickly,thensowilltheNameNodememoryusage.Thisislikelytoonlyhitasubsetofsystemsas,atthetimeofwritingthis,1GBofmemorycansupport2millionfilesorblocks,butwithadefaultheapsizeof2or4GB,thislimitcaneasilybereached.IftheNameNodeneedstostartveryaggressivelyrunninggarbagecollectionoreventuallyrunsoutofmemory,thenyourclusterwillbeveryunhealthy.ThemitigationistoassignmoreheaptotheJVM;thelonger-termapproachistocombinemanysmallfilesintoasmallernumberoflargerones.Ideally,compressedwithasplittablecompressioncodec.

Page 419: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapandreduceoptimizationsMappersandreducersbothprovideareasforoptimizingperformance;hereareafewpointerstoconsider:

Thenumberofmappersdependsonthenumberofsplits.Whenfilesaresmallerthanthedefaultblocksizeorcompressedusinganonsplittableformat,thenumberofmapperswillequalthenumberoffiles.Otherwise,thenumberofmappersisgivenbythetotalsizeofeachfiledividedbytheblocksize.CompressmappersoutputtoreducewritestodiskandincreaseI/O.LZOisagoodformatforthistask.Avoidspilltodisk:themappersshouldhaveenoughmemorytoretainasmuchdataaspossible.NumberofReducers:itisrecommendedthatyouusefewerreducersthanthetotalreducercapacity(thisavoidsexecutionwaits).

Page 420: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SecurityOnceyoubuiltacluster,thefirstthingyouthoughtaboutwashowtosecureit,right?Don’tworry,mostpeopledon’t.But,asHadoophasmovedonfrombeingsomethingrunningin-houseanalysisintheresearchdepartmenttodirectlydrivingcriticalsystems,it’snotsomethingtoignorefortoolong.

SecuringHadoopisnotsomethingtobedoneonawhimorwithoutsignificanttesting.Wecannotgivedetailedadviceonthistopicandcannotstressstronglyenoughtheneedtotakethistopicseriouslyanddoitproperly.Itmightconsumetime,itmightcostmoney,butweighthisagainstthecostofhavingyourclustercompromised.

SecurityisalsoamuchbiggertopicthanjusttheHadoopcluster.We’llexploresomeofthesecurityfeaturesavailableinHadoop,butyoudoneedacoherentsecuritystrategyintowhichthesediscretecomponentsfit.

Page 421: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

EvolutionoftheHadoopsecuritymodelInHadoop1,therewaseffectivelynosecurityprotectionastheprovidedsecuritymodelhadobviousattackvectors.TheUnixuserIDwithwhichyouconnectedtotheclusterwasassumedtobevalid,andyouhadalltheprivilegesofthatuser.Plainly,thismeantthatanyonewithadministrativeaccessonahostthatcouldaccesstheclustercouldeffectivelyimpersonateanyotheruser.

Thisledtothedevelopmentoftheso-called“headnode”accessmodel,wherebytheHadoopclusterwasfirewalledofffromeveryhostexceptone,theheadnode,andallaccesstotheclusterwasmediatedthroughthiscentrally-controllednode.Thiswasaneffectivemitigationforthelackofarealsecuritymodelandcanstillbeusefulinsituationsevenwhenrichersecurityschemesareutilized.

Page 422: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

BeyondbasicauthorizationCoreHadoophashadadditionalsecurityfeaturesadded,whichaddressthepreviousconcerns.Inparticular,theyaddressthefollowing:

AclustercanrequireausertoauthenticateviaKerberosandprovetheyarewhotheysaytheyare.Insecuremode,theclustercanalsouseKerberosforallnode-nodecommunications,ensuringthatallcommunicatingnodesareauthenticatedandpreventingmaliciousnodesfromattemptingtojointhecluster.Toeasemanagement,userscanbecollectedintogroupsagainstwhichdata-accessprivilegescanbedefined.ThisiscalledRoleBasedAccessControl(RBAC)andisaprerequisiteforasecureclusterwithmorethanahandfulofusers.Theuser-groupmappingscanberetrievedfromcorporatesystems,suchasLDAPoractivedirectory.HDFScanapplyACLstoreplacethecurrentUnix-inspiredowner/group/worldmodel.

ThesecapabilitiesgiveHadoopasignificantlystrongersecurityposturethaninthepast,butthecommunityismovingfastandadditionaldedicatedApacheprojectshaveemergedtoaddressspecificareasofsecurity.

ApacheSentryhttps://sentry.incubator.apache.orgisasystemtoprovidemuchfiner-grainedauthorizationtoHadoopdataandservices.OtherservicesbuildSentrymappings,andthisallows,forexample,specificrestrictionstobeplacednotonlyonparticularHDFSdirectories,butalsoonentitiessuchasHivetables.

WhereasSentryfocusesonprovidingmuchrichertoolsfortheinternal,fine-grainedaspectsofHadoopsecurity,ApacheKnox(http://knox.apache.org)providesasecuregatewaytoHadoopthatintegrateswithexternalidentitymanagementsystemsandprovidesaccesscontrolmechanismstoallowordisallowaccesstospecificHadoopservicesandoperations.ItdoesthisbypresentingaREST-onlyinterfacetoHadoopandsecuringallcallstothisAPI.

Page 423: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ThefutureofHadoopsecurityTherearemanyotherdevelopmentshappeningintheHadoopworld.CoreHadoop2.5addedextendedfileattributestoHDFS,whichcanbeusedasthebasisofadditionalaccesscontrolmechanisms.Futureversionswillincorporatecapabilitiesforbettersupportofencryptionfordataintransitaswellasatrest,andtheProjectRhinoinitiativeledbyIntel(https://github.com/intel-hadoop/project-rhino/)isbuildingoutrichersupportforfilesystemcryptographicmodules,asecurefilesystem,and,atsomepoint,afullerkey-managementinfrastructure.

TheHadoopdistributionvendorsaremovingfasttoaddthesecapabilitiestotheirreleases,soifyoucareaboutsecurity(youdo,don’tyou!),thenconsultthedocumentationforthelatestreleaseofyourdistribution.Newsecurityfeaturesarebeingaddedeveninpointupdatesandaren’tbeingdelayeduntilmajorupgrades.

Page 424: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ConsequencesofusingasecuredclusterAfterteasingyouwithallthesecuritygoodnessthatisnowavailableandthatwhichiscoming,it’sonlyfairtogivesomewordsofwarning.Securityisoftenhardtodocorrectly,andoftenthefeelingofsecuritywronglyemployedwithabuggydeploymentisworsethanknowingyouhavenosecurity.

However,evenifyoudoitright,thereareconsequencestorunningasecurecluster.Itmakesthingsharderfortheadministratorscertainlyandoftentheusers,sothereisdefinitelyanoverhead.SpecificHadooptoolsandserviceswillalsoworkdifferentlydependingonwhatsecurityisemployedonacluster.

Oozie,whichwediscussedinChapter8,DataLifecycleManagement,usesitsowndelegationtokensbehindthescenes.Thisallowstheoozieusertosubmitjobsthatarethenexecutedonbehalfoftheoriginallysubmittinguser.Inaclusterusingonlythebasicauthorizationmechanism,thisisveryeasilyconfigured,butusingOozieinasecureclusterwillrequireadditionallogictobeaddedtotheworkflowdefinitionsandthegeneralOozieconfiguration.Thisisn’taproblemwithHadooporOozie;however,similarlyaswiththeadditionalcomplexityresultingfromthemuchbetterHAfeaturesofHDFSinHadoop2,bettersecuritymechanismswillsimplyhavecostsandconsequencesthatyouneedtakeintoconsideration.

Page 425: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MonitoringEarlierinthischapter,wediscussedClouderaManagerasavisualmonitoringtoolandhintedthatitcouldalsobeprogrammaticallyintegratedwithothermonitoringsystems.ButbeforepluggingHadoopintoanymonitoringframework,it’sworthconsideringjustwhatitmeanstooperationallymonitoraHadoopcluster.

Page 426: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Hadoop–wherefailuresdon’tmatterTraditionalsystemsmonitoringtendstobequiteabinarytool;generallyspeaking,eithersomethingisworkingoritisn’t.Ahostisaliveordead,andawebserverisrespondingoritisn’t.ButintheHadoopworld,thingsarealittledifferent;theimportantthingisserviceavailability,andthiscanstillbetreatedasliveevenifparticularpiecesofhardwareorsoftwarehavefailed.NoHadoopclustershouldbeintroubleifasingleworkernodefails.AsofHadoop2,eventhefailureoftheserverprocesses,suchastheNameNodeshouldn’treallybeaconcernifHAisconfigured.So,anymonitoringofHadoopneedstotakeintoaccounttheservicehealthandnotthatofspecifichostmachines,whichshouldbeunimportant.Operationspeopleon24/7pagerarenotgoingtobehappygettingpagedat3AMtodiscoverthatoneworkernodeinaclusterof10,000hasfailed.Indeed,oncethescaleoftheclusterincreasesbeyondacertainpoint,thefailureofindividualpiecesofhardwarebecomesanalmostcommonplaceoccurrence.

Page 427: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MonitoringintegrationYouwon’tbebuildingyourownmonitoringtools;instead,youmightlikelywanttointegratewithexistingtoolsandframeworks.Forpopularopensourcemonitoringtools,suchasNagiosandZabbix,therearemultiplesampletemplatestointegrateHadoop’sservice-wideandnode-specificmetrics.

Thiscangivethesortofseparationhintedpreviously;thefailureoftheYARNResourceManagerwouldbeahigh-criticalityeventthatshouldmostlikelycausealertstobesenttooperationsstaff,butahighloadonspecifichostsshouldonlybecapturedandnotcausealertstobefired.Thisthenprovidesthedualityoffiringalertswhenbadthingshappeninadditiontocapturingandprovidingtheinformationneededtodelveintosystemdataovertimetodotrendanalysis.

ClouderaManagerprovidesaRESTinterface,whichisanotherpointofintegrationagainstwhichtoolssuchasNagioscanintegrateandpulltheClouderaManager-definedservice-levelmetricsinsteadofhavingtodefineitsown.

Forheavier-weightenterprise-monitoringinfrastructurebuiltonframeworks,suchasIBMTivoliorHPOpenView,ClouderaManagercanalsodelivereventsviaSNMPtrapsthatwillbecollectedbythesesystems.

Page 428: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Application-levelmetricsAttimes,youmightalsowantyourapplicationstogathermetricsthatcanbecentrallycapturedwithinthesystem.Themechanismsforthiswilldifferfromonecomputationalmodeltoanother,butthemostwell-knownaretheapplicationcountersavailablewithinMapReduce.

WhenaMapReducejobcompletes,itoutputsanumberofcounters,gatheredbythesystemthroughoutthejobexecution,thatdealwithmetricssuchasthenumberofmaptasks,byteswritten,failedtasks,andsoon.Youcanalsowriteapplication-specificmetricsthatwillbeavailablealongsidethesystemcountersandwhichareautomaticallyaggregatedacrossthemap/reduceexecution.FirstdefineaJavaenum,andnameyourdesiredmetricswithinit,asfollows:

publicenumAppMetrics{

MAX_SEEN,

MIN_SEEN,

BAD_RECORDS

};

Then,withinthemap,reduce,setup,andcleanupmethodsofyourMaporReduceimplementations,youcandosomethinglikethefollowingtoincrementacounterbyone:

Context.getCounter(AppMetrics.BAD_RECORDS).increment(1);

RefertotheJavaDocoftheorg.apache.hadoop.mapreduce.Counterinterfaceformoredetailsofthismechanism.

Page 429: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TroubleshootingMonitoringandloggingcountersoradditionalinformationisallwellandgood,butitcanbeintimidatingtoknowhowtoactuallyfindtheinformationyouneedwhentroubleshootingaproblemwithanapplication.Inthissection,wewilllookathowHadoopstoreslogsandsysteminformation.Wecandistinguishthreetypologiesoflogs,asfollows:

YARNapplications,includingMapReducejobsDaemonlogs(NameNodeandResourceManager)Servicesthatlognon-distributedworkloads,forexample,HiveServer2loggingto/var/log

Nexttotheselogtypologies,Hadoopexposesanumberofmetricsatfilesystem(thestorageavailability,replicationfactor,andnumberofblocks)andsystemlevel.Asmentioned,bothApacheAmbariandClouderaManager,whichcentralizeaccesstodebuginformation,doanicejobasthefrontend.However,underthehood,eachservicelogstoeitherHDFSorthesingle-nodefilesystem.Furthermore,YARN,MapReduce,andHDFSexposetheirlogfilesandmetricsviawebinterfacesandprogrammaticAPIs.

Page 430: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LogginglevelsHadooplogsmessagestoLog4jbydefault.Log4jisconfiguredvialog4j.propertiesintheclasspath.Thisfiledefineswhatisloggedandwithwhichlayout:

log4j.rootLogger=${root.logger}

root.logger=INFO,console

log4j.appender.console=org.apache.log4j.ConsoleAppender

log4j.appender.console.target=System.err

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.console.layout.ConversionPattern=%d{yy/MM/ddHH:mm:ss}%p

%c{2}:%m%n

ThedefaultrootloggerisINFO,console,whichlogsallmessagesatthelevelINFOandabovetotheconsole’sstderr.SingleapplicationsdeployedonHadoopcanshiptheirownlog4j.propertiesandsetthelevelandotherpropertiesoftheiremittedlogsasrequired.

HadoopdaemonshaveawebpagetogetandsettheloglevelforanyLog4jproperty.Thisinterfaceisexposedbythe/LogLevelendpointineachservicewebUI.ToenabledebugloggingfortheResourceManagerclass,wewillvisithttp://resourcemanagerhost:8088/LogLevel,andthescreenshotcanbeseenasfollows:

GettingandsettingtheloglevelonResourceManager

Alternatively,theYARNdaemonlog<host:port>commandinterfaceswiththeservice/LogLevelendpoint.Wecaninspectthelevelassociatedwithmapreduce.map.log.levelfortheResourceManagerclassusingthe–getlevel<property>parameter,asfollows:

$hadoopdaemonlog-getlevellocalhost.localdomain:8088

mapreduce.map.log.level

Connectingtohttp://localhost.localdomain:8088/logLevel?

log=mapreduce.map.log.levelSubmittedLogName:mapreduce.map.log.levelLog

Class:org.apache.commons.logging.impl.Log4JLoggerEffectivelevel:INFO

Theeffectivelevelcanbemodifiedusingthe-setlevel<property><level>option:

$hadoopdaemonlog-setlevellocalhost.localdomain:8088

mapreduce.map.log.levelDEBUG

Connectingtohttp://localhost.localdomain:8088/logLevel?

log=mapreduce.map.log.level&level=DEBUG

Page 431: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SubmittedLogName:mapreduce.map.log.level

LogClass:org.apache.commons.logging.impl.Log4JLogger

SubmittedLevel:DEBUG

SettingLeveltoDEBUG…

Effectivelevel:DEBUG

NotethatthissettingwillaffectalllogsproducedbytheResourceManagerclass.Thisincludessystem-generatedentriesaswellastheonesgeneratedbyapplicationsrunningonYARN.

Page 432: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AccesstologfilesLogfilelocationsandnamingconventionsarelikelytodifferbasedonthedistribution.ApacheAmbariandClouderaManagercentralizeaccesstologfiles,bothforservicesandsingleapplications.OnCloudera’sQuickStartVM,anoverviewofthecurrentlyrunningprocessesandlinkstotheirlogfiles,thestderrandstdoutchannelscanbefoundathttp://localhost.localdomain:7180/cmf/hardware/hosts/1/processes,andthescreenshotcanbeseenasfollows:

AccesstologresourcesinClouderaManager

AmbariprovidesasimilaroverviewviatheServicesdashboardfoundathttp://127.0.0.1:8080/#/main/servicesontheHDPSandbox,andthescreenshotcanbeseenasfollows:

Page 433: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AccesstologresourcesonApacheAmbari

Non-distributedlogsareusuallyfoundunder/var/log/<service>oneachclusternode.YARNcontainersandMRv2logslocationsalsodependonthedistribution.OnCDH5theseresourcesareavailableinHDFSunder/tmp/logs/<user>.

Thestandardmodalitytoaccessdistributedlogsiseitherviacommand-linetoolsorusingserviceswebUIs.

Forinstance,thecommandisasfollows:

$yarnapplication-list-appStatesALL

TheprecedingcommandwilllistallrunningandretriedYARNapplications.TheURLinthetaskcolumnpointstoawebinterfacethatexposesthetasklog,asfollows:

14/08/0314:44:38INFOclient.RMProxy:ConnectingtoResourceManagerat

localhost.localdomain/127.0.0.1:8032Totalnumberofapplications

(application-types:[]andstates:[NEW,NEW_SAVING,SUBMITTED,ACCEPTED,

RUNNING,FINISHED,FAILED,KILLED]):4Application-Id

Application-NameApplication-TypeUserQueue

StateFinal-StateProgress

Tracking-URLapplication_1405630696162_0002PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002

application_1405630696162_0004PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0004

application_1405630696162_0003PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

Page 434: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0003

application_1405630696162_0005PigLatin:DefaultJobName

MAPREDUCEclouderaroot.clouderaFINISHED

SUCCEEDED100%

http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0005

Forinstance,http://localhost.localdomain:19888/jobhistory/job/job_1405630696162_0002,alinktoataskbelongingtousercloudera,isafrontendtothecontentstoredunderhdfs:///tmp/logs/cloudera/logs/application_1405630696162_0002/.

Inthefollowingsections,wewillgiveanoverviewoftheavailableUIsfordifferentservices.

NoteProvisioninganEMRclusterwiththe–log-uris3://<bucket>optionwillensurethatHadooplogsarecopiedintothes3://<bucket>location.

Page 435: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ResourceManager,NodeManager,andApplicationManagerOnYARNtheResourceManagerwebUIprovidesinformationandgeneraljobstatisticsoftheHadoopcluster,running/completed/failedjobs,andajobhistorylogfile.Bydefault,theUIisexposedathttp://<resourcemanagerhost>:8088/andcanbeseeninthefollowingscreenshot:

ResourceManager

ApplicationsOntheleft-handsidebar,itispossibletoreviewtheapplicationstatusofinterest:NEW,SUBMITTED,ACCEPTED,RUNNING,FINISHING,FINISHED,FAILED,orKILLED.Dependingontheapplicationstatus,thefollowinginformationisavailable:

TheapplicationIDThesubmittinguserTheapplicationnameTheschedulerqueueinwhichtheapplicationisplacedStart/finishtimesandstateLinktotheTrackingUIforapplicationhistory

Inaddition,theClusterMetricsviewgivesyouinformationonthefollowing:

OverallapplicationstatusNumberofrunningcontainersMemoryusageNodestatus

NodesTheNodesviewisafrontendtotheNodeManagerservicemenu,whichshowshealthandlocationinformationonthenode’srunningapplications,asfollows:

Page 436: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Nodesstatus

EachindividualnodeoftheclusterexposesfurtherinformationandstatisticsathostlevelviaitsownUI.TheseincludewhichversionofHadoopisrunningonthenode,howmuchmemoryisavailableonthenode,thenodestatus,andalistofrunningapplicationsandcontainers,asshowninthefollowingscreenshot:

Singlenodeinfo

SchedulerThefollowingscreenshotshowstheSchedulerwindow:

Scheduler

MapReduceThoughthesameinformationandloggingdetailsareavailableinMapReducev1and

Page 437: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapReducev2,theaccessmodalityisslightlydifferent.

MapReducev1ThefollowingscreenshotshowstheMapReduceJobTrackerUI:

TheJobTrackerUI

TheJobTrackerUI,availablebydefaultathttp://<jobtracker>:50070,exposesinformationonallcurrentlyrunningaswellasretiredMapReducejobs,asummaryoftheclusterresourcesandhealth,aswellasschedulinginformationandcompletionpercentage,asshowninthefollowingscreenshot:

Jobdetails

Page 438: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Foreachrunningandretiredjob,detailsareavailable,includingitsID,owner,priority,taskassignment,andtasklaunchforthemapper.Clickingonajobidlinkwillleadtoajobdetailspage—thesameURLexposedbythemapredjob–listcommand.Thisresourcegivesdetailsaboutboththemapandreducetasksaswellasgeneralcounterstatisticsatthejob,filesystem,andMapReducelevels;theseincludethememoryused,numberofread/writeoperations,andthenumberofbytesreadandwritten.

ForeachMapandReduceoperation,theJobTrackerexposesthetotal,pending,running,completed,andfailedtasks,asshowninthefollowingscreenshot:

Jobtasksoverview

ClickingonthelinksintheJobtablewillleadtoafurtheroverviewatthetaskandtask-attemptlevels,asshowninthefollowingscreenshot:

Taskattempts

Fromthislastpage,wecanaccessthelogsofeachtaskattempt,bothforsuccessfulandfailed/killedtasksoneachindividualTaskTrackerhost.ThislogcontainsthemostgranularinformationaboutthestatusoftheMapReducejob,includingtheoutputofLog4jappendersaswellasoutputpipedtothestdoutandstderrchannelsandsyslog,asshowninthefollowingscreenshot:

Page 439: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

TaskTrackerlogs

MapReducev2(YARN)AswehaveseeninChapter3,Processing–MapReduceandBeyond,withYARN,MapReduceisonlyoneofmanyprocessingframeworksthatcanbedeployed.RecallfrompreviouschaptersthattheJobTrackerandTaskTrackerserviceshavebeenreplacedbytheResourceManagerandNodeManager,respectively.Assuch,boththeserviceUIsandthelogfilesfromYARNaremoregenericthanMapReducev1.

Theapplication_1405630696162_0002nameshowninResourceManagercorrespondstoaMapReducejobwiththejob_1405630696162_0002ID.ThatapplicationIDbelongstothetaskrunninginsidethecontainer,andclickingonitwillrevealanoverviewoftheMapReducejobandallowadrill-downtotheindividualtasksfromeitherphaseuntilthesingle-tasklogisreached,asshowninthefollowingscreenshot:

Page 440: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AYARNapplicationcontainingaMapReducejob

JobHistoryServerYARNshipswithaJobHistoryRESTservicethatexposesdetailsonfinishedapplications.Currently,itonlysupportsMapReduceandprovidesinformationonfinishedjobs.ThisincludesthejobfinalstatusSUCCESSFULorFAILED,whosubmittedthejob,thetotalnumberofmapandreducetasks,andtiminginformation.

AUIisavailableathttp://<jobhistoryhost>:19888/jobhistory,asshowninthefollowingscreenshot:

JobHistoryUI

ClickingoneachjobIDwillleadtotheMapReducejobUIshownintheYARNapplicationscreenshot.

Page 441: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

NameNodeandDataNodeThewebinterfacefortheHadoopDistributedFileSystem(HDFS)showsinformationabouttheNameNodeitselfaswellasthefilesystemgenerally.

Bydefault,itislocatedathttp://<namenodehost>:50070/,asshowninthefollowingscreenshot:

NameNodeUI

TheOverviewmenuexposesNameNodeinformationaboutDFScapacityandusageandtheblockpoolstatus,anditgivesasummaryofthestatusofDataNodehealthandavailability.Theinformationcontainedinthispageisforthemostpartequivalenttowhatisshownatthecommand-lineprompt:

$hdfsdfsadmin–report

TheDataNodesmenugivesmoredetailedinformationaboutthestatusofeachnodeandoffersadrill-downatthesingle-hostlevel,bothforavailableanddecommissionednodes,asshowninthefollowingscreenshot:

Page 442: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DatanodeUI

Page 443: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryThishasbeenquiteawhistle-stoptouraroundtheconsiderationsofrunninganoperationalHadoopcluster.Wedidn’ttrytoturndevelopersintoadministrators,buthopefully,thebroaderperspectivewillhelpyoutohelpyouroperationsstaff.Inparticular,wecoveredthefollowingtopics:

HowHadoopisanaturalfitforDevOpsapproachesasitsmultilayeredcomplexitymeansit’snotpossibleordesirabletohavesubstantialknowledgegapsbetweendevelopmentandoperationsstaffClouderaManager,andhowitcanbeagreatmanagementandmonitoringtool;itmightcauseintegrationproblemsthough,ifyouhaveotherenterprisetools,anditcomeswithavendorlock-inriskAmbari,theApacheopensourcealternativetoClouderaManager,andhowitisusedintheHortonworksdistributionHowtothinkaboutselectinghardwareforaphysicalHadoopcluster,andhowthisnaturallyfitsintotheconsiderationsofhowthemultipleworkloadspossibleintheworldofHadoop2canpeacefullycoexistonsharedresourcesThedifferentconsiderationsforfiringupandusingEMRclustersandhowthiscanbebothanadjunctto,aswellasanalternativeto,aphysicalclusterTheHadoopsecurityecosystem,howitisaveryfastmovingarea,andhowthefeaturesavailabletodayarevastlybetterthansomeyearsagoandthereisstillmucharoundthecornerMonitoringofaHadoopcluster,consideringwhateventsareimportantintheHadoopmodelofembracingfailure,andhowthesealertsandmetricscanbeintegratedintootherenterprise-monitoringframeworksHowtotroubleshootissueswithaHadoopcluster,bothintermsofwhatmighthavehappenedandhowtofindtheinformationtoinformyouranalysisAquicktourofthevariouswebUIsprovidedbyHadoop,whichcangiveverygoodoverviewsofhappeningswithinvariouscomponentsinthesystem

ThisconcludesourtreatmentofHadoopindepth.Inthefinalchapter,wewillexpresssomethoughtsonthebroaderHadoopecosystem,givesomepointersforusefulandinterestingtoolsandproductsthatwedidn’thaveachancetocoverinthebook,andsuggesthowtogetinvolvedwiththecommunity.

Page 444: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Chapter11.WheretoGoNextInthepreviouschapterswehaveexaminedmanypartsofHadoop2andtheecosystemaroundit.However,wehavenecessarilybeenlimitedbypagecount;someareaswedidn’tgetintoasmuchdepthaswaspossible,otherareaswereferredtoonlyinpassingordidnotmentionatall.

TheHadoopecosystem,withdistributions,Apacheandnon-Apacheprojects,isanincrediblyvibrantandhealthyplacetoberightnow.Inthischapter,wehopetocomplementthepreviouslydiscussedmoredetailedmaterialwithatravelguide,ifyouwill,forotherinterestingdestinations.Inthischapter,wewilldiscussthefollowingtopics:

HadoopdistributionsOthersignificantApacheandnon-ApacheprojectsSourcesofinformationandhelp

Ofcourse,notethatanyoverviewoftheecosystemisbothskewedbyourinterestsandpreferences,andisoutdatedthemomentitiswritten.Inotherwords,don’tforamomentthinkthisisallthat’savailable,consideritinsteadawhettingoftheappetite.

Page 445: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AlternativedistributionsWe’vegenerallyusedtheClouderadistributionforHadoopinthisbook,buthaveattemptedtokeepthecoveragedistributionindependentasmuchaspossible.We’vealsomentionedtheHortonworksDataPlatform(HDP)throughoutthisbookbutthesearecertainlynottheonlydistributionchoicesavailabletoyou.

Beforetakingalookaround,let’sconsiderwhetheryouneedadistributionatall.ItiscompletelypossibletogototheApachewebsite,downloadthesourcetarballsoftheprojectsinwhichyouareinterested,thenworktobuildthemalltogether.However,givenversiondependencies,thisislikelytoconsumemoretimethanyouwouldexpect.Potentially,vastlymoreso.Inaddition,theendproductwilllikelylacksomepolishintermsoftoolsorscriptsforoperationaldeploymentandmanagement.Formostusers,theseareasarewhyemployinganexistingHadoopdistributionisthenaturalchoice.

Anoteonfreeandcommercialextensions—beinganopensourceprojectwithaquiteliberallicense,distributioncreatorsarealsofreetoenhanceHadoopwithproprietaryextensionsthataremadeavailableeitherasfreeopensourceorcommercialproducts.

Thiscanbeacontroversialissueassomeopensourceadvocatesdislikeanycommercializationofsuccessfulopensourceprojects;tothem,itappearsthatthecommercialentityisfreeloadingbytakingthefruitsoftheopensourcecommunitywithouthavingtobuilditforthemselves.OthersseethisasahealthyaspectoftheflexibleApachelicense;thebaseproductwillalwaysbefree,andindividualsandcompaniescanchoosewhethertogowithcommercialextensionsornot.Wedon’tgivejudgmenteitherway,butbeawarethatthisisanotherofthecontroversiesyouwillalmostcertainlyencounter.

Soyouneedtodecideifyouneedadistributionandifsoforwhatreasons,whichspecificaspectswillbenefityoumostaboverollingyourown?Doyouwishforafullyopensourceproductorareyouwillingtopayforcommercialextensions?Withthesequestionsinmind,let’slookatafewofthemaindistributions.

Page 446: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ClouderaDistributionforHadoopYouwillbefamiliarwiththeClouderadistribution(http://www.cloudera.com)asithasbeenusedthroughoutthisbook.CDHwasthefirstwidelyavailablealternativedistributionanditsbreadthofavailablesoftware,provenlevelofquality,anditsfreecosthasmadeitaverypopularchoice.

Recently,ClouderahasbeenactivelyextendingtheproductsitaddstoitsdistributionbeyondthecoreHadoopprojects.InadditiontoClouderaManagerandImpala(bothCloudera-developedproducts),ithasalsoaddedothertoolssuchasClouderaSearch(basedonApacheSolr)andClouderaNavigator(adatagovernancesolution).WhileCDHversionspriorto5werefocusedmoreontheintegrationbenefitsofadistribution,version5(andpresumablybeyond)isaddingmoreandmorecapabilityatopthebaseApacheHadoopprojects.

Clouderaalsoofferscommercialsupportforitsproductsinadditiontotrainingandconsultancyservices.Detailscanbefoundonthecompanywebpage.

Page 447: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HortonworksDataPlatformIn2011,theYahoo!divisionresponsibleforsomuchofthedevelopmentofHadoopwasspunoffintoanewcompanycalledHortonworks.Theyhavealsoproducedtheirownpre-integratedHadoopdistributioncalledtheHortonworksDataPlatform(HDP),availableathttp://hortonworks.com/products/hortonworksdataplatform/.

HDPisconceptuallysimilartoCDHbutbothproductshavedifferencesintheirfocus.HortonworksmakesmuchofthefactHDPisfullyopensource,includingthemanagementtoolAmbari,whichwediscussedbrieflyinChapter10,RunningaHadoopCluster.TheyhavealsopositionedHDPasakeyintegrationplatformthroughitssupportfortoolssuchasTalendOpenStudio.Hortonworksdoesnotofferproprietarysoftware;itsbusinessmodelfocusesinsteadonofferingprofessionalservicesandsupportfortheplatform.

BothClouderaandHortonworksareventure-backedcompanieswithsignificantengineeringexpertise;bothcompaniesemploymanyofthemostprolificcontributorstoHadoop.Theunderlyingtechnologyis,however,comprisedofthesameApacheprojects;thedistinguishingfactorsarehowtheyarepackaged,theversionsemployed,andtheadditionalvalue-addedofferingsprovidedbythecompanies.

Page 448: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MapRAdifferenttypeofdistributionisofferedbyMapRTechnologies,althoughthecompanyanddistributionareusuallyreferredtosimplyasMapR.Thedistributionavailablefromhttp://www.mapr.comisbasedonHadoop,buthasaddedanumberofchangesandenhancements.

ThefocusoftheMapRdistributionisonperformanceandavailability.Forexample,itwasthefirstdistributiontoofferahigh-availabilitysolutionfortheHadoopNameNodeandJobTracker,whichyouwillrememberfromChapter2,Storage,wasasignificantweaknessincoreHadoop1.ItalsoofferednativeintegrationwithNFSfilesystemslongbeforeHadoop2,whichmakesprocessingofexistingdatamucheasier.Toachievethesefeatures,MapRreplacedHDFSwithafullPOSIXcompliantfilesystemthatalsofeaturesnoNameNode,resultinginatruedistributedsystemwithnomaster,andaclaimofmuchbetterhardwareutilizationthanApacheHDFS.

MapRprovidesbothacommunityandenterpriseeditionofitsdistribution;notalltheextensionsareavailableinthefreeproduct.Thecompanyalsoofferssupportservicesaspartoftheenterpriseproductsubscriptioninadditiontotrainingandconsultancy.

Page 449: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Andtherest…Hadoopdistributionsarenotjusttheterritoryofyoungstart-ups,noraretheyastaticmarketplace.Intelhaditsowndistributionuntilearly2014whenitdecidedtofolditschangesintoCDHinstead.IBMhasitsowndistributioncalledIBMInfosphereBigInsightsavailableinbothfreeandcommercialeditions.Therearealsovariousstoriesofnumerouslargeenterprisesrollingtheirowndistributions,someofwhicharemadeopenlyavailablewhileothersarenot.Youwillhavenoshortageofoptionswithsomanyhigh-qualitydistributionsavailable.

Page 450: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ChoosingadistributionThisraisesthequestion:howtochooseadistribution?Ascanbeseen,theavailabledistributions(andwedidn’tcoverthemall)rangefromconvenientpackagingandintegrationoffullyopensourceproductsthroughtoentirebespokeintegrationandanalysislayersatopthem.Thereisnooverallbestdistribution;thinkcarefullyaboutyourrequirementsandconsiderthealternatives.Sincealltheseofferafreedownloadofatleastabasicversion,it’sgoodtosimplyplayandexperiencetheoptionsforyourself.

Page 451: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OthercomputationalframeworksWe’vefrequentlydiscussedthemyriadpossibilitiesbroughttotheHadoopplatformbyYARN.Wewentintodetailsoftwonewmodels,SamzaandSpark.Additionally,othermoreestablishedframeworkssuchasPigarealsobeingportedtotheframework.

Togiveaviewofthemuchbiggerpictureinthissection,wewillillustratethebreadthofprocessingpossibleusingYARNbypresentingasetofcomputationalmodelsthatarecurrentlybeingportedtoHadoopontopofYARN.

Page 452: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheStormStorm(http://storm.apache.org)isadistributedcomputationframeworkwritten(mainly)intheClojureprogramminglanguage.Itusescustom-createdspoutsandboltstodefineinformationsourcesandmanipulationstoallowdistributedprocessingofstreamingdata.AStormapplicationisdesignedasatopologyofinterfacesthatcreatesastreamoftransformations.ItprovidessimilarfunctionalitytoaMapReducejobwiththeexceptionthatthetopologywilltheoreticallyrunindefinitelyuntilitismanuallyterminated.

ThoughinitiallybuiltdistinctfromHadoop,aYARNportisbeingdevelopedbyYahoo!andcanbefoundathttps://github.com/yahoo/storm-yarn.

Page 453: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheGiraphGiraphoriginatedastheopensourceimplementationofGoogle’sPregelpaper(whichcanbefoundathttp://kowshik.github.io/JPregel/pregel_paper.pdf).BothGiraphandPregelareinspiredbytheBulkSynchronousParallel(BSP)modelofdistributedcomputationintroducedbyValiantin1990.Giraphaddsseveralfeaturesincludingmastercomputation,shardedaggregators,edge-orientedinput,andout-of-corecomputation.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/GIRAPH-13.

Page 454: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ApacheHAMAHamaisatop-levelApacheprojectthataims,likeothermethodswe’veencounteredsofar,toaddresstheweaknessofMapReducewithregardtoiterativeprogramming.SimilartotheaforementionedGiraph,HamaimplementstheBSPtechniquesandhasbeenheavilyinspiredbythePregelpaper.TheYARNportcanbefoundathttps://issues.apache.org/jira/browse/HAMA-431.

Page 455: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OtherinterestingprojectsWhetheryouuseabundleddistributionorstickwiththebaseApacheHadoopdownload,youwillencountermanyreferencestootherrelatedprojects.We’vecoveredseveralofthesesuchasHive,Samza,andCrunchinthisbook;we’llnowhighlightsomeoftheothers.

Notethatthiscoverageseekstopointoutthehighlights(fromtheauthors’perspective)aswellasgiveatasteofthebreadthoftypesofprojectsavailable.Asmentionedearlier,keeplookingout,astherewillbenewoneslaunchingallthetime.

Page 456: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HBasePerhapsthemostpopularApacheHadoop-relatedprojectthatwedidn’tcoverinthisbookisHBase(http://hbase.apache.org).BasedontheBigTablemodelofdatastoragepublicizedbyGoogleinanacademicpaper(soundfamiliar?),HBaseisanonrelationaldatastoresittingatopHDFS.

WhilebothMapReduceandHivefocusonbatch-likedataaccesspatterns,HBaseinsteadseekstoprovideverylow-latencyaccesstodata.ConsequentlyHBasecan,unliketheaforementionedtechnologies,directlysupportuser-facingservices.

TheHBasedatamodelisnottherelationalapproachthatwasusedinHiveandallotherRDBMSs,nordoesitofferthefullACIDguaranteesthataretakenforgrantedwithrelationalstores.Instead,itisakey-valueschema-lesssolutionthattakesacolumn-orientedviewofdata;columnscanbeaddedatruntimeanddependonthevaluesinsertedintoHBase.Eachlookupoperationisthenveryfast,asitiseffectivelyakey-valuemappingfromtherowkeytothedesiredcolumn.HBasealsotreatstimestampsasanotherdimensiononthedatasoonecandirectlyretrievedatafromapointintime.

Thedatamodelisverypowerfulbutdoesnotsuitallusecasesjustastherelationalmodelisn’tuniversallyapplicable.Butifyouhavearequirementforstructuredlow-latencyviewsonlarge-scaledatastoredinHadoop,thenHBaseisabsolutelysomethingyoushouldlookat.

Page 457: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SqoopInChapter7,HadoopandSQL,welookedattoolsforpresentingarelational-likeinterfacetodatastoredonHDFS.Often,suchdataeitherneedstoberetrievedfromanexistingrelationaldatabaseortheoutputofitsprocessingneedstobestoredback.

ApacheSqoop(http://sqoop.apache.org)providesamechanismfordeclarativelyspecifyingdatamovementbetweenrelationaldatabasesandHadoop.IttakesataskdefinitionandfromthisgeneratesMapReducejobstoexecutetherequireddataretrievalorstorage.ItwillalsogeneratecodetohelpmanipulaterelationalrecordswithcustomJavaclasses.Inaddition,itcanintegratewithHBaseandHcatalog/Hiveanditprovidesaveryrichsetofintegrationpossibilities.

Atthetimeofwriting,Sqoopisslightlyinflux.Itsoriginalversion,Sqoop1,wasapureclient-sideapplication.MuchliketheoriginalHivecommand-linetool,Sqoop1hasnoserverandgeneratesallcodeontheclient.Thisunfortunatelymeansthateachclientneedstoknowalotofdetailsaboutphysicaldatasources,includingexacthostnamesaswellasauthenticationcredentials.

Sqoop2providesacentralizedSqoopserverthatencapsulatesallthesedetailsandoffersthevariousconfigureddatasourcestotheconnectingclients.Itisasuperiormodelbutatthetimeofwriting,thegeneralcommunityrecommendationistostickwithSqoop1untilthenewversionevolvesfurther.Checkonthecurrentstatusifyouareinterestedinthistypeoftool.

Page 458: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WhirWhenlookingtousecloudservicessuchasAmazonAWSforHadoopdeployments,itisusuallyaloteasiertouseahigherlevelservicesuchasElasticMapReduceasopposedtosettingupyourownclusteronEC2.Thoughtherearescriptstohelp,thefactisthattheoverheadofHadoop-baseddeploymentsoncloudinfrastructurescanbeinvolved.That’swhereApacheWhir(https://whirr.apache.org/)comesin.

Whirisn’tfocusedonHadoop;it’saboutsupplier-independentinstantiationofcloudservicesofwhichHadoopisasingleexample.WhiraimstoprovideaprogrammaticwayofspecifyingandcreatingHadoop-baseddeploymentsoncloudinfrastructuresinawaythathandlesalltheunderlyingserviceaspectsforyou.Itdoesthisinaprovider-independentfashionsothatonceyou’velaunchedonsayEC2thenyoucanusethesamecodetocreatetheidenticalsetuponanotherprovidersuchasRightscaleorEucalyptus.Thismakesvendorlock-in,oftenaconcernwithclouddeployments,lessofanissue.

Whirisn’tquitethereyet.Today,itislimitedinservicesitcancreateandprovidersitsupports,however,ifyouareinterestedinclouddeploymentwithlesspainthenit’sworthwatchingitsprogress.

NoteIfyouarebuildingoutyourfullinfrastructureonAmazonWebServicesthenyoumightfindcloudformationgivesmuchofthesameabilitytodefineapplicationrequirements,thoughobviouslyinanAWS-specificfashion.

Page 459: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MahoutApacheMahout(http://mahout.apache.org/)isacollectionofdistributedalgorithms,Javaclasses,andtoolsforperformingadvancedanalyticsontopofHadoop.SimilartoSpark’sMLLibbrieflymentionedinChapter5,IterativeComputationwithSpark,Mahoutshipswithanumberofalgorithmsforcommonusecases:recommendation,clustering,regression,andfeatureengineering.Althoughthesystemisfocusedonnaturallanguageprocessingandtext-miningtasks,itsbuildingblocks(linearalgebraoperations)aresuitabletobeappliedtoanumberofdomains.AsofVersion0.9,theprojectisbeingdecoupledfromtheMapReduceframeworkinfavorofricherprogrammingmodelssuchasSpark.Thecommunityendgoalistoobtainaplatform-independentlibrarybasedonaScalaDSL.

Page 460: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HueInitiallydevelopedbyClouderaandmarketedasthe“UserInterfaceforHadoop”,Hue(http://gethue.com/)isacollectionofapplications,bundledtogetherunderacommonwebinterface,thatactasclientsforcoreservicesandanumberofcomponentsoftheHadoopecosystem:

TheHueQueryEditorforHive

Hueleveragesmanyofthetoolswediscussedinpreviouschaptersandprovidesanintegratedinterfaceforanalyzingandvisualizingdata.Therearetwocomponentsthatareremarkablyinteresting.Ononehand,thereisaqueryeditorthatallowstheusertocreateandsaveHive(orImpala)queries,exporttheresultsetinCSVorMicrosoftOfficeExcelformataswellasplotitinthebrowser.TheeditorfeaturesthecapabilityofsharingbothHiveQLandresultsets,thusfacilitatingcollaborationwithinanorganization.Ontheotherhand,thereisanOozieworkflowandcoordinatoreditorthatallowsausertocreateanddeployOoziejobsmanually,automatingthegenerationofXMLconfigurationsandboilerplate.

BothClouderaandHortonworksdistributionsshipwithHueandtypicallyincludethefollowing:

AfilemanagerforHDFSAJobBrowserforYARN(MapReduce)AnApacheHBasebrowserAHivemetastoreexplorerQueryeditorsforHiveandImpalaAscripteditorforPigAjobeditorforMapReduceandSpark

Page 461: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AneditorforSqoop2jobsAnOozieworkfloweditoranddashboardAnApacheZooKeeperbrowser

Ontopofthis,HueisaframeworkwithanSDKthatcontainsanumberofwebassets,APIs,andpatternsfordevelopingthird-partyapplicationsthatinteractwithHadoop.

Page 462: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OtherprogrammingabstractionsHadoopisn’tjustextendedbyadditionalfunctionality,therearetoolstoprovideentirelydifferentparadigmsforwritingthecodeusedtoprocessyourdatawithinHadoop.

Page 463: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CascadingDevelopedbyConcurrent,andopensourcedunderanApachelicense,Cascading(http://www.cascading.org/)isapopularframeworkthatabstractsthecomplexityofMapReduceawayandallowsustocreatecomplexworkflowsontopofHadoop.Cascadingjobscancompileto,andbeexecutedon,MapReduce,Tez,andSpark.Conceptually,theframeworkissimilartoApacheCrunch,coveredinChapter9,MakingDevelopmentEasier,thoughpracticallytherearedifferencesintermsofdataabstractionsandendgoals.Cascadingadoptsatupledatamodel(similartoPig)ratherthanarbitraryobjects,andencouragestheusertorelyonahigherlevelDSL,powerfulbuilt-intypes,andtoolstomanipulatedata.

Putinsimpleterms,CascadingistoPigLatinandHiveQLwhatCrunchistoauser-definedfunction.

LikeMorphlines,whichwealsosawinChapter9,MakingDevelopmentEasier,theCascadingdatamodelfollowsasource-pipe-sinkapproach,wheredataiscapturedfromasource,pipedthroughanumberofprocessingsteps,anditsoutputisthendeliveredintoasink,readytobepickedupbyanotherapplication.

CascadingencouragesdeveloperstowritecodeinanumberofJVMlanguages.PortsoftheframeworkexistforPython(PyCascading),JRuby(Cascading.jruby),Clojure(Cascalog),andScala(Scalding).CascalogandScaldinginparticularhavegainedalotoftractionandspawnedofftheirveryownecosystems.

AnareawhereCascadingexcelsisdocumentation.TheprojectprovidescomprehensivejavadocsoftheAPI,extensivetutorials(http://www.cascading.org/documentation/tutorials/)andaninteractiveexercise-basedlearningenvironment(https://github.com/Cascading/Impatient).

AnotherstrongsellingpointofCascadingisitsintegrationwiththird-partyenvironments.AmazonEMRsupportsCascadingasafirst-classprocessingframeworkandallowsustolaunchCascadingclustersbothwiththecommandlineandwebinterfaces(http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/CreateCascading.htmlPluginsfortheSDKexistforboththeIntelliJIDEAandEclipseintegrateddevelopmentenvironments.Oneoftheframework’stopprojects,CascadingPatterns,acollectionofmachine-learningalgorithms,featuresautilityfortranslatingPredictiveModelMarkupLanguage(PMML)documentsintoapplicationsonApacheHadoop,thusfacilitatinginteroperabilitywithpopularstatisticalenvironmentsandscientifictoolssuchasR(http://cran.r-project.org/web/packages/pmml/index.html).

Page 464: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

AWSresourcesManyHadooptechnologiescanbedeployedonAWSaspartofaself-managedcluster.However,justasAmazonofferssupportforElasticMapReduce,whichhandlesHadoopasamanagedservice,thereareafewotherservicesthatareworthmentioning.

Page 465: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SimpleDBandDynamoDBForsometime,AWShasofferedSimpleDBasahostedserviceprovidinganHBase-likedatamodel.

Ithas,however,largelybeensupersededbyamorerecentservicefromAWS,DynamoDB,locatedathttp://aws.amazon.com/dynamodb.ThoughitsdatamodelisverysimilartothatofSimpleDBandHBase,itisaimedataverydifferenttypeofapplication.WhereSimpleDBhasquitearichsearchAPIbutisverylimitedintermsofsize,DynamoDBprovidesamoreconstrainedthoughconstantlyevolvingAPI,butwithaserviceguaranteeofnear-unlimitedscalability.

TheDynamoDBpricingmodelisparticularlyinteresting;insteadofpayingforacertainnumberofservershostingtheservice,youallocateacertaincapacityforread-and-writeoperations,andDynamoDBmanagestheresourcesrequiredtomeetthisprovisionedcapacity.Thisisaninterestingdevelopmentasitisamorepureservicemodel,wherethemechanismofdeliveringthedesiredperformanceiskeptcompletelyopaquetotheserviceuser.HavealookatDynamoDBbutifyouneedamuchlargerscaleofdatastorethanSimpleDBcanoffer;however,doconsiderthepricingmodelcarefullyasprovisioningtoomuchcapacitycanbecomeveryexpensiveveryquickly.AmazonprovidessomegoodbestpracticesforDynamoDBatthefollowingURLthatillustratethatminimizingtheservicecostscanresultinadditionalapplication-layercomplexity:http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html.

NoteOfcoursethediscussionofDynamoDBandSimpleDBassumesanon-relationaldatamodel;thereistheAmazonRelationalDatabaseService(AmazonRDS)forarelationaldatabaseinthecloudservice.

Page 466: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

KinesisJustasEMRishostedHadoopandDynamoDBhassimilaritiestoahostedHBase,itwasn’tsurprisingtoseeAWSannounceKinesis,ahostedstreamingdataservicein2013.Thiscanbefoundathttp://aws.amazon.com/kinesisandithasverysimilarconceptualbuildingblockstothestackofSamzaatopKafka.KinesisprovidesapartitionedviewofmessagesasastreamofdataandanAPItohavecallbacksthatexecutewhenmessagesarrive.AswithmostAWSservices,thereistightintegrationwithotherservicesmakingiteasytogetdataintoandoutoflocationssuchasS3.

Page 467: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

DataPipelineThefinalAWSservicethatwe’llmentionisDataPipeline,whichcanbefoundathttp://aws.amazon.com/datapipeline.Asthenamesuggests,itisaframeworkforbuildingupdata-processingjobsthatinvolvemultiplesteps,datamovements,andtransformations.IthasquiteaconceptualoverlapwithOozie,butwithafewtwists.Firstly,DataPipelinehastheexpecteddeepintegrationwithmanyotherAWSservices,enablingeasydefinitionofdataworkflowsthatincorporatediverserepositoriessuchasRDS,S3,andDynamoDB.Inadditionhowever,DataPipelinedoeshavetheabilitytointegrateagentsinstalledonlocalinfrastructure,providinganinterestingavenueforbuildingworkflowsthatspanacrosstheAWSandon-premisesenvironments.

Page 468: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SourcesofinformationYoudon’tjustneednewtechnologiesandtools—eveniftheyarecool.Sometimes,alittlehelpfromamoreexperiencedsourcecanpullyououtofahole.Inthisregard,youarewellcovered,astheHadoopcommunityisextremelystronginmanyareas.

Page 469: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SourcecodeIt’ssometimeseasytooverlook,butHadoopandalltheotherApacheprojectsareafterallfullyopensource.Theactualsourcecodeistheultimatesource(pardonthepun)ofinformationabouthowthesystemworks.Becomingfamiliarwiththesourceandtracingthroughsomeofthefunctionalitycanbehugelyinformative.Nottomentionhelpfulwhenyouarehittingunexpectedbehavior.

Page 470: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MailinglistsandforumsAlmostalltheprojectsandserviceslistedinthischapterhavetheirownmailinglistsand/orforums;checkoutthehomepagesforthespecificlinks.Mostdistributionsalsohavetheirownforumsandothermechanismstoshareknowledgeandget(non-commercial)helpfromthecommunity.Additionally,ifusingAWS,makesuretocheckouttheAWSdeveloperforumsathttps://forums.aws.amazon.com.

Alwaysremembertoreadpostingguidelinescarefullyandunderstandtheexpectedetiquette.Thesearetremendoussourcesofinformation;thelistsandforumsareoftenfrequentlyvisitedbythedevelopersoftheparticularproject.ExpecttoseethecoreHadoopdevelopersontheHadooplists,HivedevelopersontheHivelist,EMRdevelopersontheEMRforums,andsoon.

Page 471: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LinkedIngroupsThereareanumberofHadoopandrelatedgroupsontheprofessionalsocialnetworkLinkedIn.Doasearchforyourparticularareasofinterest,butagoodstartingpointmightbethegeneralHadoopusers’groupathttp://www.linkedin.com/groups/Hadoop-Users-988957.

Page 472: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HUGsIfyouwantmoreface-to-faceinteractionthenlookforaHadoopUserGroup(HUG)inyourarea,mostofwhichwillbelistedathttp://wiki.apache.org/hadoop/HadoopUserGroups.Thesetendtoarrangesemi-regularget-togethersthatcombinethingssuchasqualitypresentations,theabilitytodiscusstechnologywithlike-mindedindividuals,andoftenpizzaanddrinks.

NoHUGnearwhereyoulive?Considerstartingone.

Page 473: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ConferencesThoughsomeindustriestakedecadestobuildupaconferencecircuit,Hadoopalreadyhassomesignificantconferenceactioninvolvingtheopensource,academic,andcommercialworlds.EventssuchastheHadoopSummitandStrataareprettybig;theseandsomeotherarelinkedfromhttp://wiki.apache.org/hadoop/Conferences.

Page 474: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SummaryInthischapter,wetookaquickgalloparoundthebroaderHadoopecosystem,lookingatthefollowingtopics:

WhyalternativeHadoopdistributionsexistandsomeofthemorepopularonesOtherprojectsthatprovidecapabilities,extensions,orHadoopsupportingtoolsAlternativewaysofwritingorcreatingHadoopjobsSourcesofinformationandhowtoconnectwithotherenthusiasts

Now,gohavefunandbuildsomethingamazing!

Page 475: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

IndexA

additionaldata,collectingabout/Collectingadditionaldataworkflows,scheduling/SchedulingworkflowsOozietriggers/OtherOozietriggers

addMappermethod,argumentsjob/Textcleanupusingchainmapperclass/TextcleanupusingchainmapperinputKeyClass/TextcleanupusingchainmapperinputValueClass/TextcleanupusingchainmapperoutputKeyClass/TextcleanupusingchainmapperoutputValueClass/TextcleanupusingchainmappermapperConf/Textcleanupusingchainmapper

alternativedistributionsabout/AlternativedistributionsClouderaDistribution/ClouderaDistributionforHadoopHortonworksDataPlatform(HDP)/HortonworksDataPlatformMapR/MapRselecting/Choosingadistribution

Amazonaccountreferencelink/CreatinganAWSaccount

AmazonCLIreferencelink/TheAWScommand-lineinterface

AmazonEMRabout/AmazonEMRAWSaccount,creating/CreatinganAWSaccountrequiredservices,signingup/Signingupforthenecessaryservices

AmazonRelationalDatabaseService(AmazonRDS)/SimpleDBandDynamoDBAmazonWebServices

Hive,workingwith/HiveandAmazonWebServicesAmbari

about/Ambari–theopensourcealternativeURL/Ambari–theopensourcealternative

AMPLabatUCBerkeley,URL/ApacheSpark

ApacheAvroabout/AvroURL/Avro

ApacheCrunchabout/ApacheCrunchURL/ApacheCrunch

Page 476: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

JARs/Gettingstartedlibraries/Gettingstartedconcepts/ConceptsPCollection<T>interface/ConceptsPTable<Key,Value>interface/Conceptsdataserialization/Dataserializationdataprocessingpatterns/DataprocessingpatternsPipelinesimplementation/Pipelinesimplementationandexecutionexecution/Pipelinesimplementationandexecutionexamples/CrunchexamplesKiteMorphlines/KiteMorphlines

ApacheDataFureferencelink/ContributedUDFs,ApacheDataFuabout/ApacheDataFu

ApacheGiraphabout/ApacheGiraphURL/ApacheGiraph

ApacheHAMAabout/ApacheHAMA

ApacheKafkaURL/ApacheSamza,Samza’sbestfriend–ApacheKafkaabout/Samza’sbestfriend–ApacheKafkaTwitterdata,gettinginto/GettingTwitterdataintoKafka

ApacheKnoxabout/BeyondbasicauthorizationURL/Beyondbasicauthorization

ApacheSentryURL/Beyondbasicauthorization

ApacheSparkabout/ApacheSpark,GettingstartedwithSparkURL/ApacheSpark,GettingstartedwithSparkclustercomputing,withworkingsets/ClustercomputingwithworkingsetsResilientDistributedDatasets(RDDs)/ResilientDistributedDatasets(RDDs)actions/Actionsdeployment/DeploymentonYARN/SparkonYARNonEC2/SparkonEC2standaloneapplications,writing/WritingandrunningstandaloneapplicationsScalaAPI/ScalaAPIJavaAPI/JavaAPIWordCount,inJava/WordCountinJavaPythonAPI/PythonAPIdata,processing/ProcessingdatawithApacheSpark

ApacheSpark,ecosystem

Page 477: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

about/TheSparkecosystemSparkStreaming/SparkStreamingGraphX/GraphXMLLib/MLlibSparkSQL/SparkSQL

ApacheStormabout/ApacheStormURL/ApacheStorm

ApacheThriftabout/ThriftURL/Thrift

ApacheTikaabout/MultijobworkflowsURL/Multijobworkflows

ApacheTwillURL/Thinkinginlayers

ApacheZooKeeperabout/ApacheZooKeeper–adifferenttypeoffilesystemURL/ApacheZooKeeper–adifferenttypeoffilesystemdistributedlock,implementingwithsequentialZNodes/ImplementingadistributedlockwithsequentialZNodesgroupmembership,implementing/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesleaderelection,implementingwithephemeralZNodes/ImplementinggroupmembershipandleaderelectionusingephemeralZNodesJavaAPI/JavaAPIblocks,building/Buildingblocksused,forenablingautomaticNameNodefailover/AutomaticNameNodefailover

applicationdevelopmentframework,selecting/Choosingaframework

ApplicationManagerabout/ResourceManager,NodeManager,andApplicationManager

ApplicationMaster(AM)about/AnatomyofaYARNapplication

architecturalprinciples,HDFSandMapReduce/CommonbuildingblocksArraywrapperclasses

about/ArraywrapperclassesautomaticNameNodefailover

enabling/AutomaticNameNodefailoverAvro

about/AvroAvroschemaevolution,using

thoughts/FinalthoughtsonusingAvroschemaevolution

Page 478: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

additivechanges,making/Onlymakeadditivechangesschemaversions,managingexplicitly/Manageschemaversionsexplicitlyschemadistribution/Thinkaboutschemadistribution

Avroschemasabout/UsingtheJavaAPI

AvroSerdeURL/Avroabout/Avro

AWSabout/DistributionsofApacheHadoop,AWS–infrastructureondemandfromAmazonSimpleStorageService(S3)/SimpleStorageService(S3)ElasticMapReduce(EMR)/ElasticMapReduce(EMR)

AWScommand-lineinterfaceabout/TheAWScommand-lineinterfacereferencelink/TheAWScommand-lineinterface

AWScredentialsabout/AWScredentialsaccountID/AWScredentialsaccesskey/AWScredentialssecretaccesskey/AWScredentialskeypairs/AWScredentialsreferencelink/AWScredentials

AWSdeveloperforumsURL/Mailinglistsandforums

AWSresourcesabout/AWSresourcesSimpleDB/SimpleDBandDynamoDBDynamoDB/SimpleDBandDynamoDBDataPipeline/DataPipeline

Page 479: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Bblockreplication

about/BlockreplicationBulkSynchronousParallel(BSP)model

about/ApacheGiraph

Page 480: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

CCascading

about/CascadingURL/Cascadingreferencelinks/Cascading

ClouderaURL/DistributionsofApacheHadoopURL,fordocumentation/ClouderaManagerURL,forblogpost/Sharingresources

Clouderadistributionabout/ClouderaDistributionforHadoopURL/ClouderaDistributionforHadoop

ClouderaHadoopDistribution(CDH)about/ClouderaManager

ClouderaKittenURL/Thinkinginlayers

ClouderaManagerabout/ClouderaManagerpayment,forsubscriptionservices/Topayornottopayclustermanagement,performing/ClustermanagementusingClouderaManagerintegrating,withsystemsmanagementtools/ClouderaManagerandothermanagementtoolsmonitoringwith/MonitoringwithClouderaManagerlogfiles,finding/Findingconfigurationfiles

ClouderaManagerAPIabout/ClouderaManagerAPI

ClouderaManagerlock-inabout/ClouderaManagerlock-in

ClouderaQuickstartVMabout/ClouderaQuickStartVMadvantages/ClouderaQuickStartVM

clusterbuilding,onEMR/BuildingaclusteronEMR

cluster,APacheSparkcomputing,withworkingsets/Clustercomputingwithworkingsets

cluster,onEMRfilesystem,considerations/Considerationsaboutfilesystemsdata,obtainingintoEMR/GettingdataintoEMREC2instances/EC2instancesandtuningEC2tuning/EC2instancesandtuning

clustermanagementperforming,ClouderaManagerused/ClustermanagementusingClouderaManager

Page 481: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

clusterstartup,HDFSabout/ClusterstartupNameNodestartup/NameNodestartupDataNodestartup/DataNodestartup

clustertuningabout/ClustertuningJVMconsiderations/JVMconsiderationsmapoptimization/Mapandreduceoptimizationsreduceoptimization/Mapandreduceoptimizations

column-orienteddataformatsabout/Column-orienteddataformatsRCFile/RCFileORC/ORCParquet/ParquetAvro/AvroJavaAPI,using/UsingtheJavaAPI

columnarabout/Columnarstores

columnarstores/Columnarstorescombinerclass,JavaAPItoMapReduce

about/CombinercombineValuesoperation

about/Conceptscommand-lineaccess,HDFSfilesystem

about/Command-lineaccesstotheHDFSfilesystemhdfscommand/Command-lineaccesstotheHDFSfilesystemdfscommand/Command-lineaccesstotheHDFSfilesystemdfsadmincommand/Command-lineaccesstotheHDFSfilesystem

Comparableinterfaceabout/TheComparableandWritableComparableinterfaces

complexdatatypesmap/Pigdatatypestuple/Pigdatatypesbag/Pigdatatypes

complexeventprocessing(CEP)about/HowSamzaworks

components,Hadoopabout/ComponentsofHadoopcommonbuildingblocks/Commonbuildingblocksstorage/Storagecomputation/Computation

components,YARNabout/ThecomponentsofYARNResourceManager(RM)/ThecomponentsofYARN

Page 482: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

NodeManager(NM)/ThecomponentsofYARNcomputation

about/Computationcomputation,Hadoop2

about/ComputationinHadoop2computationalframeworks

about/OthercomputationalframeworksApacheStorm/ApacheStormApacheGiraph/ApacheGiraph,ApacheHAMA

conferencesabout/Conferencesreferencelink/Conferences

configurationfile,Samzaabout/Theconfigurationfile

containersabout/SerializationandContainers

contributedUDFsabout/ContributedUDFsPiggybank/PiggybankElephantBird/ElephantBirdApacheDataFu/ApacheDataFu

create.hqlscriptreferencelink/ExtractingdataandingestingintoHive

Crunchexamplesabout/Crunchexampleswordco-occurrence/Wordco-occurrenceTF-IDF/TF-IDF

Curatorprojectreferencelink/Buildingblocks

Page 483: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Ddata,managing

about/ManagingandserializingdataWritableinterface/TheWritableinterfacewrapperclasses/IntroducingthewrapperclassesArraywrapperclasses/ArraywrapperclassesComparableinterface/TheComparableandWritableComparableinterfacesWritableComparableinterface/TheComparableandWritableComparableinterfaces

data,Pigworkingwith/WorkingwithdataFILTERoperator/Filteringaggregation/AggregationFOREACHoperator/ForeachJOINoperator/Join

data,storingabout/Storingdataserializationfileformat/SerializationandContainerscontainersfileformat/SerializationandContainersfilecompression/Compressiongeneral-purposefileformats/General-purposefileformatscolumn-orienteddataformats/Column-orienteddataformats

Datacoreabout/DataCore

DataCrunchabout/DataCrunch

DataHCatalogabout/DataHCatalog

DataHiveabout/DataHive

datalifecyclemanagementabout/Whatdatalifecyclemanagementisimportance/Importanceofdatalifecyclemanagementtools/Toolstohelp

DataMapReduceabout/DataMapReduce

DataNode/NameNodeandDataNodeDataNodes

about/StorageinHadoop2DataNodestartup

about/DataNodestartupDataPipeline

about/DataPipeline

Page 484: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

referencelink/DataPipelinedataprocessing

about/DataprocessingwithHadoopdataset,generatingfromTwitter/WhyTwitter?dataset,building/Buildingourfirstdatasetprogrammaticaccess,withPython/ProgrammaticaccesswithPython

dataprocessing,ApacheSparkabout/ProcessingdatawithApacheSparkexamples,running/Buildingandrunningtheexamplesexamples,building/Buildingandrunningtheexamplesexamples,runningonYARN/RunningtheexamplesonYARNpopulartopics,finding/Findingpopulartopicssentiment,assigningtotopics/Assigningasentimenttotopicsonstreams/Dataprocessingonstreamsstatemanagement/Statemanagementdataanalysis,withSparkSQL/DataanalysiswithSparkSQLSQL,ondatastreams/SQLondatastreams

dataprocessingpatterns,Crunchabout/Dataprocessingpatternsaggregationandsorting/Aggregationandsortingjoiningdata/Joiningdata

dataserialization,Crunchabout/Dataserialization

dataset,buildingwithTwitterabout/BuildingourfirstdatasetmultipleAPIs,using/Oneservice,multipleAPIsanatomy,ofTweet/AnatomyofaTweetTwittercredentials/Twittercredentials

DataSparkabout/DataSpark

datatypes,Hivenumeric/Datatypesdateandtime/Datatypesstring/Datatypescollections/Datatypesmisc/Datatypes

datatypes,Pigscalardatatypes/Pigdatatypescomplexdatatypes/Pigdatatypes

DDLstatements,Hive/DDLstatementsdecayFactorfunction/StatemanagementDEFINEoperator

about/ExtendingPig(UDFs)deriveddata,producing

Page 485: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

about/Producingderiveddatamultipleactions,performinginparallel/Performingmultipleactionsinparallelsubworkflow,calling/Callingasubworkflowglobalsettings,adding/Addingglobalsettings

DevOpspractices/HadoopandDevOpspractices

directedacyclicgraph(DAG)about/YARN

documentfrequencyabout/Calculatedocumentfrequencycalculating,TF-IDFused/Calculatedocumentfrequency

DrillURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond

Driverclass,JavaAPItoMapReduceabout/TheDriverclass

dynamicinvokersabout/Dynamicinvokersreferencelink/Dynamicinvokers

DynamoDBURL/SimpleDBandDynamoDBabout/SimpleDBandDynamoDB

Page 486: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

EEC2

ApacheSparkon/SparkonEC2EC2key-valuepair

referencelink/TheAWScommand-lineinterfaceElasticMapReduce

Hive,usingwith/HiveonElasticMapReduceElasticMapReduce(EMR)

about/DistributionsofApacheHadoop,ElasticMapReduce(EMR)URL/ElasticMapReduce(EMR)using/UsingElasticMapReduce

ElephantBirdreferencelink/ContributedUDFs,ElephantBird

EMRcluster,buildingon/BuildingaclusteronEMRURL,forbestpractices/BuildingaclusteronEMR

EMRdocumentationURL/HiveonElasticMapReduce

entitiesabout/Tweetmetadata

ephemeralZNodesabout/ImplementinggroupmembershipandleaderelectionusingephemeralZNodes

evalfunctions,PigAVG(expression)/EvalCOUNT(expression)/EvalCOUNT_STAR(expression)/EvalIsEmpty(expression)/EvalMAX(expression)/EvalMIN(expression)/EvalSUM(expression)/EvalTOKENIZE(expression)/Eval

examplesrunning/Runningtheexamples

examples,MapReduceprogramsreferencelink/Runningtheexampleslocalcluster/LocalclusterElasticMapReduce/ElasticMapReduce

examplesandsourcecodedownloadlink/Gettingstarted

ExecutionEngineinterface/AnoverviewofPigexternaldata,challenges

about/Challengesofexternaldata

Page 487: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

datavalidation/Datavalidationvalidationactions/Validationactionsformatchanges,handling/Handlingformatchangesschemaevolution,handlingwithAvro/HandlingschemaevolutionwithAvro

EXTERNALkeyword/DDLstatementsExtract-Transform-Load(ETL)/DDLstatementsextract_for_hive.pig

URL,forsourcecode/Prerequisites

Page 488: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

FFalcon

URL/Othertoolstohelpabout/Othertoolstohelp

fileformat,Hiveabout/FileformatsandstorageJSON/JSON

FileFormatclasses,HiveTextInputFormat/FileformatsandstorageHiveIgnoreKeyTextOutputFormat/FileformatsandstorageSequenceFileInputFormat/FileformatsandstorageSequenceFileOutputFormat/Fileformatsandstorage

filesystemmetadata,HDFSprotecting/ProtectingthefilesystemmetadataSecondaryNameNode,demerits/SecondaryNameNodenottotherescueHadoop2NameNodeHA/Hadoop2NameNodeHAclientconfiguration/Clientconfigurationfailover,working/Howafailoverworks

FILTERoperatorabout/Filtering

FlumeJavareferencelink/ApacheCrunch

FOREACHoperatorabout/Foreach

forknodeabout/Performingmultipleactionsinparallel

functions,Pigabout/Pigfunctionsbuilt-infunctions/Pigfunctionsreferencelink,forbuilt-infunctions/Pigfunctionsload/storefunctions/Load/storeeval/Evaltuple/Thetuple,bag,andmapfunctionsbag/Thetuple,bag,andmapfunctionsmap/Thetuple,bag,andmapfunctionsstring/Themath,string,anddatetimefunctionsmath/Themath,string,anddatetimefunctionsdatetime/Themath,string,anddatetimefunctionsdynamicinvokers/Dynamicinvokersmacros/Macros

Page 489: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

GGarbageCollection(GC)/JVMconsiderationsGarbageFirst(G1)collector/JVMconsiderationsgeneral-purposefileformats

about/General-purposefileformatsTextfiles/General-purposefileformatsSequenceFile/General-purposefileformats

generalavailability(GA)/AnoteonversioningGoogleChubbysystem

referencelink/ApacheZooKeeper–adifferenttypeoffilesystemGoogleFileSystem(GFS)

referencelink/ThebackgroundofHadoopGradle

URL/RunningtheexamplesGraphX

about/GraphXURL/GraphX

groupByKey()method/AggregationandsortinggroupByKey(GroupingOptionsoptions)method/AggregationandsortinggroupByKey(intnumPartitions)method/AggregationandsortinggroupByKeyoperation

about/ConceptsGROUPoperator

about/AggregationGrunt

about/Grunt–thePiginteractiveshellshcommand/Grunt–thePiginteractiveshellhelpcommand/Grunt–thePiginteractiveshell

GuavalibraryURL/TheTopNpattern

Page 490: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HHadoop

versioning/Anoteonversioningbackground/ThebackgroundofHadoopcomponents/ComponentsofHadoopdualapproach/Adualapproachabout/Gettingstartedusing/GettingHadoopupandrunningEMR,using/HowtouseEMRAWScredentials/AWScredentialsdataprocessing/DataprocessingwithHadooppractices/HadoopandDevOpspracticesalternativedistributions/Alternativedistributionscomputationalframeworks/Othercomputationalframeworksinterestingprojects/Otherinterestingprojectsprogrammingabstractions/OtherprogrammingabstractionsAWSresources/AWSresourcessourcesofinformation/Sourcesofinformation

Hadoop-providedInputFormat,MapReducejobabout/Hadoop-providedInputFormatFileInputFormat/Hadoop-providedInputFormatSequenceFileInputFormat/Hadoop-providedInputFormatTextInputFormat/Hadoop-providedInputFormatKeyValueTextInputFormat/Hadoop-providedInputFormat

Hadoop-providedMapperandReducerimplementations,JavaAPItoMapReduceabout/Hadoop-providedmapperandreducerimplementationsmappers/Hadoop-providedmapperandreducerimplementationsreducers/Hadoop-providedmapperandreducerimplementations

Hadoop-providedOutputFormat,MapReducejobabout/Hadoop-providedOutputFormatFileOutputFormat/Hadoop-providedOutputFormatNullOutputFormat/Hadoop-providedOutputFormatSequenceFileOutputFormat/Hadoop-providedOutputFormatTextOutputFormat/Hadoop-providedOutputFormat

Hadoop-providedRecordReader,MapReducejobabout/Hadoop-providedRecordReaderLineRecordReader/Hadoop-providedRecordReaderSequenceFileRecordReader/Hadoop-providedRecordReader

Hadoop2about/Hadoop2–what’sthebigdeal?storage/StorageinHadoop2computation/ComputationinHadoop2diagrammaticrepresentation,architecture/ComputationinHadoop2

Page 491: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

referencelink/Gettingstartedoperations/OperationsintheHadoop2world

Hadoop2NameNodeHAabout/Hadoop2NameNodeHAenabling/Hadoop2NameNodeHAkeeping,insync/KeepingtheHANameNodesinsync

HadoopDistributedFileSystem(HDFS)/NameNodeandDataNodeHadoopdistributions

about/DistributionsofApacheHadoopHortonworks/DistributionsofApacheHadoopCloudera/DistributionsofApacheHadoopMapR/DistributionsofApacheHadoopreferencelink/DistributionsofApacheHadoop

Hadoopfilesystemsabout/Hadoopfilesystemsreferencelink/HadoopfilesystemsHadoopinterfaces/Hadoopinterfaces

Hadoopinterfacesabout/HadoopinterfacesJavaFileSystemAPI/JavaFileSystemAPILibhdfs/LibhdfsApacheThrift/Thrift

Hadoopoperationsabout/I’madeveloper–Idon’tcareaboutoperations!

Hadoopsecurityfuture/ThefutureofHadoopsecurity

Hadoopsecuritymodelevolution/EvolutionoftheHadoopsecuritymodeladditionalsecurityfeatures/Beyondbasicauthorization

Hadoopstreamingabout/Hadoopstreamingwordcount,streaminginPython/StreamingwordcountinPythondifferencesinjobs/Differencesinjobswhenusingstreamingimportanceofwords,determining/Findingimportantwordsintext

HadoopUIURL/Othertoolstohelpabout/Othertoolstohelp

HadoopUserGroup(HUG)/HUGshashtagRegExp/Trendingtopicshashtags

about/SentimentofhashtagsHBase

about/HBaseURL/HBase

Page 492: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HCatalogabout/IntroducingHCatalogusing/UsingHCatalog

HCatCLItoolabout/UsingHCatalog

hcatutilityabout/UsingHCatalog

HDFSabout/ComponentsofHadoop,Storage,SamzaandHDFScharacteristics/Storagearchitecture/TheinnerworkingsofHDFSNameNode/TheinnerworkingsofHDFSDataNodes/TheinnerworkingsofHDFSclusterstartup/Clusterstartupblockreplication/Blockreplication

HDFSandMapReducemerits/Bettertogether

HDFSfilesystemcommand-lineaccess/Command-lineaccesstotheHDFSfilesystemexploring/ExploringtheHDFSfilesystem

HDFSsnapshotsabout/HDFSsnapshots

HelloSamzaabout/HelloSamza!URL/HelloSamza!

high-availability(HA)about/StorageinHadoop2

HighPerformanceComputing(HPC)/ComputationinHadoop2Hive

about/Hive-on-tezURL/Hive-on-tezoverview/OverviewofHivedatatypes/DatatypesDDLstatements/DDLstatementsfileformats/Fileformatsandstoragestorage/Fileformatsandstoragequeries/Queriesscripts,writing/Writingscriptsworking,withAmazonWebServices/HiveandAmazonWebServicesusing,withS3/HiveandS3using,withElasticMapReduce/HiveonElasticMapReduceURL,forsourcecodeofJDBCclient/JDBCURL,forsourcecodeofThriftclient/Thrift

Hive-JSON-Serde

Page 493: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

URL/JSONhive-jsonmodule

URL/JSONabout/JSON

Hive-on-tezabout/Hive-on-tez

Hive0.13about/Hive-on-tez

Hivearchitectureabout/Hivearchitecture

HiveQLabout/WhySQLonHadoop,Queriesextending/ExtendingHiveQL

HiveServer2about/HivearchitectureURL/Hivearchitecture

Hivetablesabout/ThenatureofHivetablesstructuring,fromworkloads/StructuringHivetablesforgivenworkloads

Hortonwork’sHDPURL/SparkonYARN

HortonworksURL/DistributionsofApacheHadoop

HortonworksDataPlatform(HDP)about/Alternativedistributions,HortonworksDataPlatformURL/HortonworksDataPlatform

Hueabout/HueURL/Hue

HUGsabout/HUGsreferencelink/HUGs

Page 494: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

IIAMconsole

URL/HiveandS3IBMInfosphereBigInsights

about/Andtherest…IdentityandAccessManagement(IAM)/AWScredentialsImpala

about/Impalareferences/Impala,Co-existingwithHivearchitecture/ThearchitectureofImpalaco-existing,withHive/Co-existingwithHive

in-syncreplicas(ISR)about/GettingTwitterdataintoKafka

indicesattribute,entityabout/Tweetmetadata

input/output,MapReducejobabout/Input/Output

InputFormat,MapReducejobabout/InputFormatandRecordReader

Page 495: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

JJava

WordCount/WordCountinJavaJavaAPI

about/JavaAPIandScalaAPI,differences/JavaAPI

JavaAPItoMapReduceabout/JavaAPItoMapReduceMapperclass/TheMapperclassReducerclass/TheReducerclassDriverclass/TheDriverclasscombinerclass/Combinerpartitioning/PartitioningHadoop-providedMapperandReducerimplementations/Hadoop-providedmapperandreducerimplementationsreferencedata,sharing/Sharingreferencedata

JavaFileSystemAPIabout/JavaFileSystemAPI

JDBCabout/JDBC

JobTrackermonitoring,MapReducejobabout/OngoingJobTrackermonitoring

joinnodeabout/Performingmultipleactionsinparallel

JOINoperatorabout/Join

/QueriesJSON

about/JSONJSONSimple

URL/BuildingatweetparsingjobJVMconsiderations,clustertuning

about/JVMconsiderationssmallfilesproblem/Thesmallfilesproblem

Page 496: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Kkite-morphlines-avrocommand/Morphlinecommandskite-morphlines-core-stdiocommand/Morphlinecommandskite-morphlines-core-stdlibcommand/Morphlinecommandskite-morphlines-hadoop-corecommand/Morphlinecommandskite-morphlines-hadoop-parquet-avrocommand/Morphlinecommandskite-morphlines-hadoop-rcfilecommand/Morphlinecommandskite-morphlines-hadoop-sequencefilecommand/Morphlinecommandskite-morphlines-jsoncommand/MorphlinecommandsKiteData

about/KiteDataDatacore/DataCoreDatacore/DataCoreDataHCatalog/DataHCatalogDataHive/DataHiveDataMapReduce/DataMapReduceDataSpark/DataSparkDataCrunch/DataCrunch

Kiteexamplesreferencelink/KiteData

KiteJARsreferencelink/KiteData

KiteMorphlinesabout/KiteMorphlinesconcepts/ConceptsRecordabstractions/Conceptscommands/Morphlinecommands

KiteSDKURL/KiteData

KVMreferencelink/ClouderaQuickStartVM

Page 497: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

LLambdasyntax

URL/PythonAPILibhdfs

about/LibhdfsLinkedIngroups

about/LinkedIngroupsURL/LinkedIngroups

Log4jabout/Logginglevels

logfilesaccessingto/Accesstologfiles

logginglevelsabout/Logginglevels

Page 498: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

MMachineLearning(ML)

about/MLlibmacros

about/MacrosMahout

about/MahoutURL/Mahout

mapoptimization,clustertuningconsiderations/Mapandreduceoptimizations

Mapperclass,JavaAPItoMapReduceabout/TheMapperclass

mapperexecution,MapReducejobabout/Mapperexecution

mapperinput,MapReducejobabout/Mapperinput

mapperoutput,MapReducejobabout/Mapperoutputandreducerinput

mappers,MapperandReducerimplementationsInverseMapper/Hadoop-providedmapperandreducerimplementationsTokenCounterMapper/Hadoop-providedmapperandreducerimplementationsIdentityMapper/Hadoop-providedmapperandreducerimplementations

MapRURL/DistributionsofApacheHadoop,MapRabout/MapR

MapReducereferencelink/ThebackgroundofHadoop,MapReduceabout/MapReduceMapphase/MapReduce

MapReduceAPIabout/ComponentsofHadoop,Computation

MapReducedriversourcecodereferencelink/Morphlinecommands

MapReducejobabout/WalkingthrougharunofaMapReducejobstartup/Startupinput,splitting/Splittingtheinputtaskassignment/Taskassignmenttaskstartup/TaskstartupJobTrackermonitoring/OngoingJobTrackermonitoringmapperinput/Mapperinputmapperexecution/Mapperexecutionmapperoutput/Mapperoutputandreducerinput

Page 499: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

reducerinput/Reducerinputreducerexecution/Reducerexecutionreduceroutput/Reduceroutputshutdown/Shutdowninput/output/Input/OutputInputFormat/InputFormatandRecordReaderRecordReader/InputFormatandRecordReaderHadoop-providedInputFormat/Hadoop-providedInputFormatHadoop-providedRecordReader/Hadoop-providedRecordReaderOutputFormat/OutputFormatandRecordWriterRecordWriter/OutputFormatandRecordWriterHadoop-providedOutputFormat/Hadoop-providedOutputFormatsequencefiles/Sequencefiles

MapReduceprogramswriting/WritingMapReduceprograms,Gettingstartedexamples,running/RunningtheexamplesWordCountexample/WordCount,theHelloWorldofMapReducewordco-occurrences/Wordco-occurrencessocialnetworktopics/Trendingtopicsreferencelink,forHashTagCountexamplesourcecode/TrendingtopicsTopNpattern/TheTopNpatternreferencelink,forTopTenHashTagsourcecode/TheTopNpatternhashtags/Sentimentofhashtagsreferencelink,forHashTagSentimentsourcecode/Sentimentofhashtagstextcleanup,chainmapperused/Textcleanupusingchainmapperreferencelink,forHashTagSentimentChainsourcecode/Textcleanupusingchainmapper

MassivelyParallelProcessing(MPP)about/ThearchitectureofImpala

MemPipelineabout/MemPipeline

MessagePassingInterface(MPI)/ComputationinHadoop2MLLib

about/MLlibmonitoring

about/MonitoringHadoop/Hadoop–wherefailuresdon’tmatterapplication-levelmetrics/Application-levelmetrics

monitoringtoolsabout/Monitoringintegration

MoprhlineDrviersourcecodereferencelink/Morphlinecommands

Morphlinecommandskite-morphlines-core-stdio/Morphlinecommands

Page 500: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

kite-morphlines-core-stdlib/Morphlinecommandskite-morphlines-avro/Morphlinecommandskite-morphlines-json/Morphlinecommandskite-morphlines-hadoop-parquet-avro/Morphlinecommandskite-morphlines-hadoop-sequencefile/Morphlinecommandskite-morphlines-hadoop-rcfile/Morphlinecommandsreferencelink/Morphlinecommands

MRExecutionEngine/AnoverviewofPigMultipartUpload

URL/GettingdataintoEMR

Page 501: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

NNameNode

about/StorageinHadoop2/NameNodeandDataNodeNameNodeHA

about/StorageinHadoop2NameNodestartup

about/NameNodestartupNFSshare/KeepingtheHANameNodesinsyncNodeManager

about/ResourceManager,NodeManager,andApplicationManagerNodeManager(NM)

about/ThecomponentsofYARN

Page 502: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

OOozie

about/IntroducingOozieURL/IntroducingOoziefeatures/IntroducingOozieactionnodes/IntroducingOozieHDFSfilepermissions/AnoteonHDFSfilepermissionsdevelopment,makingeasier/Makingdevelopmentalittleeasierdata,extracting/ExtractingdataandingestingintoHivedata,ingestingintoHive/ExtractingdataandingestingintoHiveworkflowdirectorystructure/AnoteonworkflowdirectorystructureHCatalog/IntroducingHCatalogsharelib/TheOoziesharelibHCatalogandpartitionedtables/HCatalogandpartitionedtablesusing/Pullingitalltogether

Oozietriggers/OtherOozietriggersOozieworkflow

about/IntroducingOozie/IntroducingOozieoperations,Hadoop2

about/OperationsintheHadoop2worldopinionlexicon

URL/SentimentofhashtagsOptimizedRowColumnarfileformat(ORC)

about/ORCreferencelink/ORC

ORCURL/Columnarstores

org.apache.zookeeper.ZooKeeperclassabout/JavaAPI

OutputFormat,MapReducejobabout/OutputFormatandRecordWriter

Page 503: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

PparallelDooperation

about/ConceptsPARALLELoperator

about/AggregationParquet

referencelink/Parquetabout/ParquetURL/Columnarstores

partitioning,JavaAPItoMapReduceabout/Partitioningoptionalpartitionfunction/Theoptionalpartitionfunction

PCollection<T>interface,Crunchabout/Concepts

physicalclusterbuilding/Buildingaphysicalcluster

physicalcluster,considerationsabout/Physicallayoutrackawareness/Rackawarenessservicelayout/Servicelayoutservice,upgrading/Upgradingaservice

Pigoverview/AnoverviewofPigusecases/AnoverviewofPigabout/Gettingstarted,WhySQLonHadooprunning/RunningPigreferencelink,forsourcecodeandbinarydistributions/RunningPigGrunt/Grunt–thePiginteractiveshellElasticMapReduce/ElasticMapReducefundamentals/FundamentalsofApachePigreferencelink,forparallelfeature/FundamentalsofApachePigreferencelink,formulti-queryimplementation/FundamentalsofApachePigprogramming/ProgrammingPigdatatypes/Pigdatatypesfunctions/Pigfunctionsdata,workingwith/Workingwithdata

Piggybankabout/Piggybank

PigLatin/AnoverviewofPigPigUDFs

extending/ExtendingPig(UDFs)contributedUDFs/ContributedUDFs

pipelinesimplementation,ApacheCrunch

Page 504: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

about/PipelinesimplementationandexecutionSparkPipeline/SparkPipelineMemPipeline/MemPipeline

positive_wordsoperatorabout/Join

pre-requisitesabout/Prerequisites

PredictiveModelMarkupLanguage(PMML)/Cascadingprocessingmodels,YARN

ClouderaKitten/ThinkinginlayersApacheTwill/Thinkinginlayers

programmaticinterfacesabout/ProgrammaticinterfacesJDBC/JDBCThrift/Thrift

ProjectRhinoURL/ThefutureofHadoopsecurity

PTable<Key,Value>interface,Crunchabout/Concepts

Pythonused,forprogrammaticaccess/ProgrammaticaccesswithPython

PythonAPIabout/PythonAPI

Page 505: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

QQJMmechanism

about/KeepingtheHANameNodesinsyncqueries,Hive/Queries

Page 506: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

RRDDs

about/Clustercomputingwithworkingsets,ResilientDistributedDatasets(RDDs)

RDDs,operationsmap/Actionsfilter/Actionsreduce/Actionscollect/Actionsforeach/ActionsgroupByKey/ActionssortByKey/Actions

Recordabstractionsimplementing/Concepts

RecordReader,MapReducejobabout/InputFormatandRecordReader

RecordWriter,MapReducejobabout/OutputFormatandRecordWriter

Reducefunctionabout/MapReduce

reduceoptimization,clustertuningconsiderations/Mapandreduceoptimizations

Reducerclass,JavaAPItoMapReduceabout/TheReducerclass

reducerexecution,MapReducejobabout/Reducerexecution

reducerinput,MapReducejobabout/Reducerinput

reduceroutput,MapReducejobabout/Reduceroutput

reducers,MapperandReducerimplementationsIntSumReducer/Hadoop-providedmapperandreducerimplementationsLongSumReducer/Hadoop-providedmapperandreducerimplementationsIdentityReducer/Hadoop-providedmapperandreducerimplementations

referencedata,JavaAPItoMapReducesharing/Sharingreferencedata

REGISTERoperatorabout/ExtendingPig(UDFs)

requiredservices,AWSSimpleStorageService(S3)/SigningupforthenecessaryservicesElasticMapReduce/SigningupforthenecessaryservicesElasticComputeCloud(EC2)/Signingupforthenecessaryservices

ResourceManager

Page 507: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

about/ResourceManager,NodeManager,andApplicationManagerapplications/ApplicationsNodesview/NodesSchedulerwindow/SchedulerMapReduce/MapReduceMapReducev1/MapReducev1MapReducev2(YARN)/MapReducev2(YARN)JobHistoryServer/JobHistoryServer

resourcessharing/Sharingresources

RoleBasedAccessControl(RBAC)/BeyondbasicauthorizationRowColumnarFile(RCFile)

about/RCFilereferencelink/RCFile

Page 508: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

SS3

Hive,usingwith/HiveandS3s3distcp

URL/GettingdataintoEMRs3n/HadoopfilesystemsSamza

about/ApacheSamzaURL/ApacheSamza,StreamprocessingwithSamzaYARN-independentframeworks/YARN-independentframeworksused,forstreamprocessing/StreamprocessingwithSamzaworking/HowSamzaworksarchitecture/Samzahigh-levelarchitectureApacheKafka/Samza’sbestfriend–ApacheKafkaintegrating,withYARN/YARNintegrationindependentmodel/AnindependentmodelHelloSamza/HelloSamza!tweetparsingjob,building/Buildingatweetparsingjobconfigurationfile/TheconfigurationfileURL,forconfigurationoptions/TheconfigurationfileTwitterdata,gettingintoApacheKafka/GettingTwitterdataintoKafkaHDFS/SamzaandHDFSwindowfunction,adding/Windowingfunctionsmultijobworkflows/Multijobworkflowstweetsentimentanalysis,performing/Tweetsentimentanalysistasksprocessing/StatefultasksandSparkStreaming,comparing/ComparingSamzaandSparkStreaming

Samza,layersstreaming/Samzahigh-levelarchitectureexecution/Samzahigh-levelarchitectureprocessing/Samzahigh-levelarchitecture

Samzajobexecuting/RunningaSamzajob

sbtURL/GettingstartedwithSpark

ScalaandJavasourcecode,examplesURL/Buildingandrunningtheexamples

ScalaAPIabout/ScalaAPI

scalardatatypesint/Pigdatatypeslong/Pigdatatypesfloat/Pigdatatypes

Page 509: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

double/Pigdatatypeschararray/Pigdatatypesbytearray/Pigdatatypesboolean/Pigdatatypesdatetime/Pigdatatypesbiginteger/Pigdatatypesbigdecimal/Pigdatatypes

ScalasourcecodeURL/Dataprocessingonstreams

SecondaryNameNodeabout/SecondaryNameNodenottotherescuedemerits/SecondaryNameNodenottotherescue

securedclusterusing,consequences/Consequencesofusingasecuredcluster

securityabout/Security

sentimentanalysisabout/Sentimentofhashtags

SequenceFileabout/General-purposefileformats

SequenceFileclass,MapReducejobabout/Sequencefiles

sequencefiles,MapReducejobabout/Sequencefilesadvantages/Sequencefiles

SerDeclasses,HiveMetadataTypedColumnsetSerDe/FileformatsandstorageThriftSerDe/FileformatsandstorageDynamicSerDe/Fileformatsandstorage

serializationabout/SerializationandContainers

sharelib,Oozieabout/TheOoziesharelib

SimpleDBabout/SimpleDBandDynamoDB

SimpleStorageService(S3),AWSabout/SimpleStorageService(S3)URL/SimpleStorageService(S3)

sourcesofinformation,Hadoopabout/Sourcesofinformationsourcecode/Sourcecodemailinglists/Mailinglistsandforumsforums/MailinglistsandforumsLinkedIngroups/LinkedIngroups

Page 510: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

HUGs/HUGsconferences/Conferences

Sparkabout/ApacheSparkURL/ApacheSpark

SparkContextobject/ScalaAPISparkPipeline

about/SparkPipelineSparkSQL

about/SparkSQLdataanalysiswith/DataanalysiswithSparkSQL

SparkStreamingURL/SparkStreamingabout/SparkStreamingandSamza,comparing/ComparingSamzaandSparkStreaming

specializedjoinreferencelink/Join

speedofthoughtanalysis/AdifferentphilosophySQL

ondatastreams/SQLondatastreamsondatastreams,URL/SQLondatastreams

SQL-on-Hadoopneedfor/WhySQLonHadoopsolutions/OtherSQL-on-Hadoopsolutions

Sqoopabout/SqoopURL/Sqoop

Sqoop1about/Sqoop

Sqoop2about/Sqoop

standaloneapplications,ApacheSparkwriting/Writingandrunningstandaloneapplicationsrunning/Writingandrunningstandaloneapplications

statementsabout/FundamentalsofApachePig

Stingerinitiativeabout/Stingerinitiative

storageabout/Storage

storage,Hadoop2about/StorageinHadoop2

storage,Hiveabout/Fileformatsandstorage

Page 511: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

columnarstores/ColumnarstoresStorm

URL/HowSamzaworksabout/HowSamzaworks

stream.pyreferencelink/ProgrammaticaccesswithPython

streamprocessingwithSamza/StreamprocessingwithSamza

streamsdata,processingon/Dataprocessingonstreams

systemsmanagementtoolsClouderaManager,integratingwith/ClouderaManagerandothermanagementtools

Page 512: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Ttablepartitioning

about/Partitioningatabledata,overwriting/Overwritingandupdatingdatadata,updating/Overwritingandupdatingdatabucketing/Bucketingandsortingsorting/Bucketingandsortingdata,sampling/Samplingdata

TajoURL/Drill,Tajo,andbeyondabout/Drill,Tajo,andbeyond

tasksprocessing,Samzaabout/Statefultasks

termfrequencyabout/Calculatetermfrequencycalculating,withTF-IDF/Calculatetermfrequency

textattribute,entityabout/Tweetmetadata

Textfilesabout/General-purposefileformats

Tezabout/TezURL/Tez,Stingerinitiativereferencelink,forcanonicalWordCountexample/TezHive-on-tez/Hive-on-tez

/AnoverviewofPigTF-IDF

about/Findingimportantwordsintextdefinition/Findingimportantwordsintexttermfrequency,calculating/Calculatetermfrequencydocumentfrequency,calculating/Calculatedocumentfrequencyimplementing/Puttingitalltogether–TF-IDF

Thriftabout/Thrift

TOBAG(expression)function/Thetuple,bag,andmapfunctionsTOMAP(expression)function/Thetuple,bag,andmapfunctionstools,datalifecyclemanagement

orchestrationservices/Toolstohelpconnectors/Toolstohelpfileformats/Toolstohelp

TOP(n,column,relation)function/Thetuple,bag,andmapfunctionsTOTUPLE(expression)function/Thetuple,bag,andmapfunctionstroubleshooting

Page 513: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

about/Troubleshootingtuples

about/FundamentalsofApachePigTweet,structure

referencelink/AnatomyofaTweettweetanalysiscapability

building/Buildingatweetanalysiscapabilitytweetdata,obtaining/GettingthetweetdataOozie/IntroducingOoziederiveddata,producing/Producingderiveddata

tweetsentimentanalysisperforming/Tweetsentimentanalysisbootstrapstreams/Bootstrapstreams

Twitterused,forgeneratingdataset/DataprocessingwithHadoopURL/DataprocessingwithHadoopabout/WhyTwitter?signuppage/Twittercredentialswebform/Twittercredentials

Twitterdata,propertiesunstructured/WhyTwitter?structured/WhyTwitter?graph/WhyTwitter?geolocated/WhyTwitter?realtime/WhyTwitter?

TwitterSearchURL/Trendingtopics

Twitterstreamanalyzing/AnalyzingtheTwitterstreamprerequisites/Prerequisitesdatasetexploration/Datasetexplorationtweetmetadata/Tweetmetadatadatapreparation/Datapreparationtopnstatistics/Topnstatisticsdatetimemanipulation/Datetimemanipulationsessions/Sessionsusers’interaction,capturing/Capturinguserinteractionslinkanalysis/Linkanalysisinfluentialusers,identifying/Influentialusers

Page 514: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Uunionoperation

about/ConceptsupdateFuncfunction/StatemanagementUserDefinedAggregateFunctions(UDAFs/ExtendingHiveQLUserDefinedFunctions(UDFs)/AnoverviewofPig,ExtendingHiveQL

about/FundamentalsofApachePigUserDefinedTableFunctions(UDTF)/ExtendingHiveQL

Page 515: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

Vversioning,Hadoop

about/AnoteonversioningVirtualBox

referencelink/ClouderaQuickStartVMVMware

referencelink/ClouderaQuickStartVM

Page 516: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

WWhir

about/WhirURL/Whir

WhotoFollowservicereferencelink/Influentialusers

windowfunctionadding/Windowingfunctions

WordCountinJava/WordCountinJava

WordCountexample,MapReduceprogramsabout/WordCount,theHelloWorldofMapReducereferencelink,forsourcecode/Wordco-occurrences

workflow-appabout/IntroducingOozie

workflow.xmlfilereferencelink/ExtractingdataandingestingintoHive

workflowsbuilding,Oozieused/Pullingitalltogether

workloadsHivetables,structuringfrom/StructuringHivetablesforgivenworkloads

wrapperclassesabout/Introducingthewrapperclasses

WritableComparableinterfaceabout/TheComparableandWritableComparableinterfaces

Writableinterfaceabout/TheWritableinterface

Page 517: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

YYARN

about/ComputationinHadoop2,YARN,YARNintherealworld–ComputationbeyondMapReducearchitecture/YARNarchitecturecomponents/ThecomponentsofYARNprocessingframeworks/Thinkinginlayersprocessingmodels/Thinkinginlayersissues,withMapReduce/TheproblemwithMapReduceTez/TezApacheSpark/ApacheSparkApacheSamza/ApacheSamzafuture/YARNtodayandbeyondpresentsituation/YARNtodayandbeyondSamza,integrating/YARNintegrationApacheSparkon/SparkonYARNexamples,runningon/RunningtheexamplesonYARNURL/RunningtheexamplesonYARN

YARNAPIabout/Thinkinginlayers

YARNapplicationanatomy/AnatomyofaYARNapplicationApplicationMaster(AM)/AnatomyofaYARNapplicationlifecycle/LifecycleofaYARNapplicationfault-tolerance/Faulttoleranceandmonitoringmonitoring/Faulttoleranceandmonitoringexecutionmodels/Executionmodels

Page 518: the-eye.euthe-eye.eu/public/Site-Dumps/index-of/index-of.co.uk/Big-Data... · Table of Contents Learning Hadoop 2 Credits About the Authors About the Reviewers Support files, eBooks,

ZZooKeeperFailoverController(ZKFC)/AutomaticNameNodefailoverZooKeeperquorum/AutomaticNameNodefailover