Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ:...

6
Airflow allows developers, admins and operations teams to author, schedule and orchestrate workflows and jobs within an organization. While it’s main focus started with orchestrating data pipelines, it’s ability to work seamlessly outside of the Hadoop stack makes it a compelling solution to manage even traditional workloads. The paper discusses the architecture of Airflow as a big data platform and how it can help address these challenges to create a stable data pipelines for enterprises. Orchestrating Big Data with Apache Airflow July 2016 6185 W DETROIT ST | CHANDLER, AZ 85226 | (623) 282-2385 | CLAIRVOYANTSOFT.COM | [email protected]

Transcript of Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ:...

Page 1: Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ: RabbitMQ is the distributed messaging service that is leveraged by Celery Executor

Airflow allows developers, admins and operations teams to author, schedule and orchestrate workflows and jobs within an organization. While it’s main focus started with orchestrating data pipelines, it’s ability to work seamlessly outside of the Hadoop stack makes it a compelling solution to manage even traditional workloads. The paper discusses the architecture of Airflow as a big data platform and how it can help address these challenges to create a stable data pipelines for enterprises.

Orchestrating Big Data with Apache Airflow July 2016

6185 W DETROIT ST | CHANDLER, AZ 85226 | (623) 282-2385 | CLAIRVOYANTSOFT.COM | [email protected]

Page 2: Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ: RabbitMQ is the distributed messaging service that is leveraged by Celery Executor

OVERV IEW

DataAnalyticsisplayingakeyroleinthedecision-makingprocessat various stages of business in many industries. Data is beinggeneratedataveryfastpacethroughvarioussourcesacrossthebusiness.Applicationsthatautomatethebusinessprocessesareliterally fountainsofdata today. Implementing solutions forusecases like “real time data ingestion from various sources”,“processingthedataatdifferentlevelsofthedataingestion”andpreparingthefinaldataforanalysisisaseriouschallengegiventhedynamic nature of the data that is being generated. Properorchestrating, scheduling, managing and monitoring the datapipelines isacritical taskforanydataplatformtobestableandreliable.Thedynamicnatureofthedatasources,datainflowrates,dataschema,processingneeds,etc.,theworkflowmanagement(pipeline generation / maintenance/monitoring) creates thesechallengesforanydataplatform.

Thiswhitepaperprovidesaviewonsomeoftheopensourcetoolsavailabletoday.ThepaperalsodiscussestheuniquearchitectureofAirflowasabigdataplatformandhowitcanhelpaddressthesechallenges to create a stable data platform for enterprises. Inaddition,ingestionwon'tbehaltedatthefirstsignoftrouble.

INTRODUCT ION TO A IRFLOW

Airflow is a platform to programmatically author, schedule andmonitordatapipelinesthatmeetstheneedofalmostallthestagesof the lifecycleofWorkflowManagement.ThesystemhasbeenbuiltbyAirbnbonthebelowfourprinciples:

• Dynamic: Airflow pipelines are configuration as code(Python),allowingfordynamicpipelinegeneration.Thisallows for writing code that instantiates pipelinesdynamically.

• Extensible:Easilydefineyourownoperators,executorsand extend the library so that it fits the level ofabstractionthatsuitsyourenvironment.

• Elegant: Airflow pipelines are lean and explicit.Parameterizing your scripts is built into the core ofAirflowusingthepowerfulJinjatemplatingengine.

• Scalable:Airflowhasamodulararchitectureandusesamessagequeue toorchestrateanarbitrarynumberofworkers.Airflowisreadytoscaletoinfinity.

BasicconceptsofAirflow

• DAGs:DirectedAcyclicGraph–isacollectionofallthetasksyouwanttorun,organizedinawaythatreflectstheirrelationshipsanddependencies.

o DAGs are defined as python scripts and areplaced in the DAGs folder (could be anylocation, but needs to be configured in theairflowconfigfile).

o Once a new DAG is placed into the DAGSfolder, the DAGS are picked up by Airflowautomaticallywithinaminute’stime.

• Operators: An operator describes a single task in aworkflow.WhileDAGsdescribehowtorunaworkflow,Operatorsdeterminewhatgetsdone.

o Task: Once an operator is instantiated usingsomeparameters,itisreferredtoasa“task”

o Task Instance: A task executed at a time iscalledTaskInstance.

• SchedulingtheDAGs/Tasks:TheDAGsandTaskscanbescheduled to be run at certain frequency using thebelowparameters.

o Scheduleinterval:DetermineswhentheDAGshould be triggered. This can be a cronexpressionoradatetimeobjectofpython.

• Executors: Once the DAGs, Tasks and the schedulingdefinitionsare inplace, someoneneedtoexecute thejobs/tasks.HereiswhereExecutorscomeintopicture.

o There are three typesof executorsprovidedbyAirflowoutofthebox.

o Sequential:ASequentialexecutorisfor test drive that can execute thetasks one by one (sequentially).Taskscannotbeparallelized.

• Local:AlocalexecutorislikeSequentialexecutor.Butitcanparallelizetaskinstanceslocally.

• Celery: Celery executor is a open source DistributedTasksExecutionEnginethatbasedonmessagequeuesmaking it more scalable and fault tolerant. MessagequeueslikeRabbitMQorRediscanbeusedalongwithCelery.

• Thisistypicallyusedforproductionpurposes.

Airflowhasanedgeoverothertoolsinthespace

Beloware some key featureswhereAirflowhas anupper handoverothertoolslikeLuigiandOozie:

• Pipelinesareconfiguredviacodemakingthepipelinesdynamic

• A graphical representation of the DAG instances andTaskInstancesalongwiththemetrics.

• Scalability:DistributionofWorkersandQueuesforTaskexecution

• HotDeploymentofDAGS/Tasks

• Support for Celery, SLAs, great UI for monitoringmatrices

• Has support for Calendar schedule and Crontabscheduling

• Backfill: Ability to rerun a DAG instance in case of afailure.

Page 3: Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ: RabbitMQ is the distributed messaging service that is leveraged by Celery Executor

• Variables for making the changes to the DAGS/Tasksquickandeasy

ARCH ITECTURE OF A IRFLOW

Airflowtypicallyconstitutesofthebelowcomponents.

• Configuration file: All the configuration points like“whichporttorunthewebserveron”,“whichexecutorto use”, “config related to RabbitMQ/Redis”,workers,DAGSlocation,repositoryetc.areconfigured.

• Metadatadatabase(MySQLorpostgres):ThedatabasewhereallthemetadatarelatedtotheDAGS,DAGruns,tasks,variablesarestored.

• DAGs(DirectedAcyclicGraphs):ThesearetheWorkflowdefinitions (logical units) that contains the taskdefinitionsalongwiththedependenciesinfo.Thesearetheactualjobsthattheuserwouldbeliketoexecute.

• Scheduler: A component that is responsible fortriggeringtheDAGinstancesandjobinstancesforeachDAG.TheschedulerisalsoresponsibleforinvokingtheExecutor(beitLocalorCeleryorSequential)

• Broker(RedisorRabbitMQ):IncaseofaCeleryexecutor,thebrokerisrequiredtoholdthemessagesandactasacommunicatorbetweentheexecutorandtheworkers.

• Worker nodes: The actual workers that execute thetasksandreturntheresultofthetask.

• Webserver:AwebserverthatrenderstheUIforAirflowthroughwhichonecanviewtheDAGs,itsstatus,rerun,createvariables,connectionsetc.

HOW IT WORKS

• Initially theprimary (instance1)andstandby(instance2)schedulerswouldbeupandrunning.Theinstance1would be declared as primary Airflow server in theMySQLtable.

• TheDAGsfolderforprimaryinstance(instance1)wouldcontain the actualDAGs and theDAGs folder and thestandby instance (instance 2) would containCounterpartPoller(PrimaryServerPoller).

o PrimaryserverwouldbeschedulingtheactualDAGsasrequired.

o Standby server would be running thePrimaryServerPoller which wouldcontinuously poll the Primary Airflowscheduler.

• Let'sassume,thePrimaryserverhasgonedown.Inthatcase, the PrimaryServerPoller would detect the sameand

o DeclareitselfastheprimaryAirflowserverintheMySQLtable.

o MovetheactualDAGsoutofDAGsfolderandmoves thePrimaryServerPollerDAG into the

DAGs folder on the older primary server(instance1)

o Move the actualDAGs intoDAGs folder andmove the PrimaryServerPoller DAG out ofDAGsfolderontheolderstandby(instance2).

• Sohere,thePrimaryandStandbyservershaveswappedtheirpositions.

• Even if the airflow scheduler on the current standbyserver (instance 1) comes back, since therewould beonly theCounterServerPollerDAGrunningon it, therewouldbenoharm.And this server (instance1)wouldremain to be standby till the current Primary server(instance2)goesdown.

• Incasethecurrentprimaryservergoesdown,thesameprocess would repeat and the airflow running oninstance1wouldbecomethePrimaryserver.

DEPLOYMENT V IEWS

Basedontheneeds,onemayhavetogowithasimplesetuporacomplexsetupofAirflow.TherearedifferentwaysAirflowcanbedeployed(especially fromanExecutorpointofview).Belowarethedeploymentoptionsalongwiththedescriptionforeach.

Standalonemodeofdeployment

Description: As mentioned in the above section, the typicalinstallationofAirflowwillstartasfollows.

• Configurationfile(airflow.cfg):whichcontainsthedetailsofwhere to pick the DAGs from, what Executor to run, howfrequentlytheschedulershouldpolltheDAGsfolderfornewdefinitions,whichporttostartthewebserveronetc.

• MetadataRepository:Typically,MySQLorpostgresdatabaseisusedforthispurpose.AllthemetadatarelatedtotheDAGs,theirmetrics,Tasksand their statuses,SLAs,Variables,etc.arestoredhere.

Page 4: Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ: RabbitMQ is the distributed messaging service that is leveraged by Celery Executor

• WebServer:ThisrendersthebeautifulUIthatshowsalltheDAGs,theircurrentstatesalongwiththemetrics(whicharepulledfromtheRepo).

• Scheduler: This reads the DAGs, put the details about theDAGsintoRepo.ItinitiatestheExecutor.

• Executor:Thisisresponsibleforreadingthescheduleintervalinfoandcreates the instances for theDAGsandTasks intoRepo.

• Worker:Theworker reads the tasks instancesandperformthetasksandwritesthestatusbacktotheRepo.

Distributedmodeofdeployment

Description: The description for most of the componentsmentionedintheStandalonesectionremainthesameexceptfortheExecutorandtheworkers.

• RabbitMQ: RabbitMQ is the distributed messaging servicethatisleveragedbyCeleryExecutortoputthetaskinstancesinto.Thisiswheretheworkerswouldtypicallyreadthetasksforexecution.Basically,thereisabrokerURLthatisexposedbyRabbitMQfortheCeleryExecutorandWorkerstotalkto.

• Executor: Here the executor would be Celery executor(configuredinairflow.cfg).TheCeleryexecutorisconfiguredtopointtotheRabbitMQBroker.

• Workers:Theworkersareinstalledondifferentnodes(basedon the requirement) and they are configured to read thetasksinfofromtheRabbitMQbrokers.TheworkersarealsoconfiguredwithaWorker_result_backendwhichtypicallycanbeconfiguredtobetheMySQLrepoitself.

Theimportantpointtobenotedhereis:

TheWorkernodesisnothingbuttheairflowinstallation.TheDAGdefinitionsshouldbe insynconall thenodes (boththeprimaryairflowinstallationandtheWorkernodes)

DistributedmodeofdeploymentwithHighAvailabilitysetup

Description: As part of the setup for high availability of Airflowinstallation,weareassumingifMySQLrepositoryisconfiguredtobehighlyavailableandRabbitMQwouldbehighlyavailable.Thefocus is on how tomake the airflow components like theWebServerandtheSchedulerhighlyavailable.

The description formost of the componentswould remain thesameasabove.Belowarewhatchanges:

• Newairflowinstance(standby):Therewouldbeanotherinstanceofairflowsetupasastandby.

o The ones shown in Green is the primaryairflowinstance.

o Theoneinredisthestandbyone.• A new DAG must be put in place. Something called

“CounterPartPoller”.ThepurposeofthisDAGwouldbetwo-fold

o To continuously poll the counterpartschedulertoseeifitisupandrunning.

o If the counterpart instance is not reachable(whichmeanstheinstanceisdown),

• DeclarethecurrentairflowinstanceasthePrimary• MovetheDAGsofthe(previous)primaryinstanceoutof

DAGsfolderandmovetheCounterpartPollerDAGintotheDAGsfolder.

• MovetheactualDAGs intotheDAGsfolderandmovethe Counterpart Poller out of DAGs folder on thestandbyserver(theonewhichdeclaresitselfasprimarynow).

NotethatthedeclarationasprimaryserverbytheinstancescanbedoneasaflaginsomeMySQLtable.

TYP ICAL STAGES OF WORKFLOWMANAGEMENT

ThetypicalstagesofthelifecycleforWorkflowManagementofBigDataareasfollows:

• CreateJobstointeractwithsystemsthatoperateonDatao Use of tools/products like: Hive / Presto /

HDFS/Postgres/S3etc.• (Dynamic)Workflowcreation

Page 5: Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ: RabbitMQ is the distributed messaging service that is leveraged by Celery Executor

o Based on the number of sources, size of data,business logic, variety of data, changes in theschema,andthelistgoeson.

• ManageDependenciesbetweenOperationso Upstream,Downstream,CrossPipeline

dependencies,PreviousJobstate,etc.• ScheduletheJobs/Operations

o Calendar schedule,EventDriven,CronExpressionetc.

• KeeptrackoftheOperationsandthemetricsoftheworkflowo Monitorthecurrent/historicstateofthejobs,the

resultsofthejobsetc.• Ensuring Fault tolerance of the pipelines and capability to

backfillanymissingdata,etc.

Thislistgrowsasthecomplexityincreases.

TOOLS THAT SOLVE WORKFLOWMANAGEMENT

There are amanyWorkflowManagement Tools in themarket.Somehave support forBigDataoperationsoutof thebox, andsomethatneedextensivecustomization/extensionstosupportBigDatapipelines.

• Oozie: Oozie is a workflow scheduler system to manageApacheHadoopjobs

• BigDataScript: BigDataScript is intended as a scriptinglanguageforbigdatapipeline

• Makeflow:Makeflowisaworkflowengineforexecutinglargecomplexworkflowsonclusters,andgrids

• Luigi:LuigiisaPythonmodulethathelpsyoubuildcomplexpipelinesofbatchjobs.(ThisisastrongcontenderforAirflow)

• Airflow: Airflow is a platform to programmatically author,scheduleandmonitorworkflows

• Azkaban:AzkabanisabatchworkflowjobschedulercreatedatLinkedIntorunHadoopjobs

• Pinball:PinballisascalableworkflowmanagerdevelopedatPinterest

Mostofthementionedtoolsmeetsthebasicneedoftheworkflowmanagement. When it comes to dealing with the complexworkflows,onlyfewoftheaboveshine.Luigi,Airflow,OozieandPinballarethetoolspreferred(andarebeingusedinProduction)bymostteamsacrosstheindustry.

None of the existing resources (on the web) talk aboutarchitectureandabout the setupofAirflow inproductionwithCeleryExecutorandmoreimportantlyonhowAirflowneedstobeconfigured to be highly available. Hence here is an attempt tosharethatmissinginformation.

INSTALLAT ION STEPS FOR A IRFLOW AND HOW IT

WORKS .

Installpip

Installationsteps

1

2

sudoyuminstallepel-release

sudoyuminstallpython-pippython-wheel

InstallErlang

Installationsteps

1

2

sudoyuminstallwxGTK

sudoyuminstallerlang

RabbitMQ

Installationsteps

1

wgethttps://www.rabbitmq.com/releases/rabbitmq-server/v3.6.2/rabbitmq-server-3.6.2-1.noarch.rpm

sudoyuminstallrabbitmq-server-3.6.2-1.noarch.rpm

Celery

Installationsteps

1pipinstallcelery

Airflow:Pre-requisites

Installationsteps

1sudoyuminstallgcc-gfortranlibgfortrannumpyredhat-rpm-configpython-develgcc-c++

Airflow

Installationsteps

Page 6: Orchestrating Big Data with Apache Airflow › 2019 › 01 › guidetoapache… · • RabbitMQ: RabbitMQ is the distributed messaging service that is leveraged by Celery Executor

1

2

3

4

5

#createahomedirectoryforairflow

mkdir~/airflow

#exportthelocationtoAIRFLOW_HOMEvariable

exportAIRFLOW_HOME=~/airflow

pipinstallairflow

InitializetheAirflowDatabase

InstallationSteps1airflowinitdb

By default, Airflow installs with SQLite DB. Above step wouldcreate a airflow.cfg file within “$AIRFLOW_HOME”/ directory.Once this is done, you may want to change the Repositorydatabase to some well-known (Highly Available) relationsdatabaselike“MySQL”,Postgresetc.Thenreinitializethedatabase(usingairflowinitdbcommand).Thatwouldcreatealltherequiredtablesforairflowintherelationaldatabase.

StarttheAirflowcomponents

Installationsteps

1

2

3

4

5

6

#StarttheScheduler

airflowscheduler

#StarttheWebserver

airflowwebserver

#StarttheWorker

airflowworker

HerearefewimportantConfigurationpointsinairflow.cfgfile

• Dags_folder=/root/airflow/DAGS

o Thefolderwhereyourairflowpipelineslive• executor=LocalExecutor

o Theexecutorclassthatairflowshoulduse.• sql_alchemy_conn=mysql://root:root@localhost/airflow

o TheSqlAlchemyconnectionstringtothemetadatadatabase.

• base_url=http://localhost:8080o The hostname and port at which the Airflow

webserverruns• broker_url=sqla+mysql://root:root@localhost:3306/airflow

o TheCelerybrokerURL.CelerysupportsRabbitMQ,Redisandexperimentallyasqlalchemydatabase

• celery_result_backend=db+mysql://root:root@localhost:3306/airflow

o AkeyCelerysettingthatdeterminesthelocationofwheretheworkerswritetheresults.

Thisshouldgiveyouarunningairflowinstanceandsetyouonthepathtorunitinproduction.

Ifyouhaveanyquestions,feedback–wewouldlovetohearfromyou,dodrop in a line [email protected] Airflow extensively in production and would love to hearmoreaboutyourneedsandthoughts.

ABOUT CLA IRVOYANT

Clairvoyantisaglobaltechnologyconsultingandservicescompany.Wehelporganizationsbuildinnovativeproductsandsolutionsusingbigdata,analytics,andthecloud.Weprovidebest-in-classsolutionsandservicesthatleveragebigdataandcontinuallyexceedclientexpectations.Ourdeepverticalknowledgecombinedwithexpertiseonmultiple,enterprise-gradebigdataplatformshelpssupportpurpose-builtsolutionstomeetourclient’sbusinessneeds.