Everything You want to know about oozie
-
Upload
anmol-prakash-gautam -
Category
Documents
-
view
218 -
download
0
Transcript of Everything You want to know about oozie
-
7/22/2019 Everything You want to know about oozie
1/31
Everything that you ever wanted to
know about Oozie, but were afraid
to ask
B Lublinsky, A Yakubovich
-
7/22/2019 Everything You want to know about oozie
2/31
Apache Oozie
Oozie is a workflow/coordination system tomanage Apache Hadoop jobs.
A single Oozie server implements all four
functional Oozie components: Oozie workflow
Oozie coordinator
Oozie bundle Oozie SLA.
-
7/22/2019 Everything You want to know about oozie
3/31
Main components
Data
Oozie Server
Coordinator
Coordinator
Hadoop
Coordinator
Oozie Command
Line Interface
3rd party application
definitions,states
WS API
job submissionand monitoring
workflow
action
action
action
action
Oozie shared
libraries
Coordinator
wf logic
Bundle
Coordinator
Coordinator
Bundle
Coordinator
Coordinator
Workflow
time condition monitoring
data condition monitoring
HDFS
MapReduce
-
7/22/2019 Everything You want to know about oozie
4/31
Oozie workflow
-
7/22/2019 Everything You want to know about oozie
5/31
Workflow LanguageFlow-control
node
XML element type DescriptionDecision workflow:DECISION expressing switch-case logicFork workflow:FORK splits one path of execution into multiple concurrent pathsJoin workflow:JOIN waits until every concurrent execution path of a previous fork
node arrives to itKill workflow:kill forces a workflow job to kill (abort) itselfAction node XML element type
Description
java workflow:JAVA invokes the main() method from the specified java classfs workflow:FS manipulate files and directories in HDFS; supports commands:
move, delete, mkdirMapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,
streaming job or pipe jobPig workflow:pig runs a Pig jobSub workflow workflow:SUB-
WORKFLOWruns a child workflow job
Hive * workflow:HIVE runs a Hive jobShell * workflow:SHELL runs a Shell commandssh * workflow:SSH starts a shell command on a remote machine as a remote secure
shellSqoop * workflow:SQOOP runs a Sqoop jobEmail * workflow:EMAIL sending emails from Oozie workflow applicationDistcp ? Under development (Yahoo)
-
7/22/2019 Everything You want to know about oozie
6/31
Workflow actions
ActionStartCommand JavaAct ionExecutorWorkflowStore Services JobClientActionExecutorContext
1 : workflow := getWorkflow()
2 : action := getAction()
3 : context := init()
4 : executor := get()
5 : start()
6 : submitLauncher()
7 : jobClient := get()
8 : runningJob := submit()
9 : setStartData()
Oozie workflow supports two types of actions:
Synchronous, executed inside Oozie runtime Asynchronous, executed as a Map Reduce job.
-
7/22/2019 Everything You want to know about oozie
7/31
Workflow lifecycle
PREP
RUNNINGKILLED
SUSPENDED
FAILED
SUCCEDDED
-
7/22/2019 Everything You want to know about oozie
8/31
Oozie execution console
-
7/22/2019 Everything You want to know about oozie
9/31
Extending Oozie workflow
Oozie provides a minimal workflow language, whichcontains only a handful of control and actions nodes.
Oozie supports a very elegant extensibility mechanismcustom action nodes. Custom action nodes allow to extendOozie language with additional actions (verbs).
Creation of custom action requires implementation offollowing: Java action implementation, which extends ActionExecutor
class.
Implementation of the actions XML schema defining actionsconfiguration parameters
Packaging of java implementation and configuration schemainto action jar, which has to be added to Oozie war
extending oozie-site.xml to register information about customexecutor with Oozie runtime.
-
7/22/2019 Everything You want to know about oozie
10/31
Oozie Workflow Client Oozie provides an easy way for integration with enterprise
applications through Oozie client APIs. It provides twotypes of APIs
REST HTTP APINumber of HTTP requests
Info requests (job status, job configuration)
Job management (submit, start, suspend, resume, kill)
Example: job definition info request
GET /oozie/v0/job/job-ID?show=definition
Java API - package org.apache.oozie.client OozieClient
start(), submit(), run(), reRunXXX(), resume(), kill(), suspend()
WorkflowJob, WorkflowAction
CoordinatorJob, CoordinatorAction
SLAEvent
-
7/22/2019 Everything You want to know about oozie
11/31
Oozie workflow good, bad and ugly
Good Nice integration with Hadoop ecosystem, allowing to easily build
processes encompassing synchronized execution of multiple MapReduce, Hive, Pig, etc jobs.
Nice UI for tracking execution progress
Simple APIs for integration with other applications
Simple extensibility APIs
Bad
Process has to be expressed directly in hPDL with no visual support
No support for Uber Jars (but we added our own)
Ugly Static forking (but you can regenerate workflow and invoke on a fly)
No support for loops
-
7/22/2019 Everything You want to know about oozie
12/31
Oozie Coordinator
-
7/22/2019 Everything You want to know about oozie
13/31
Coordinator languageElement type Description Attributes and sub-elementscoordinator-
app
top-level element in coordinator instance frequency
startend
controls specify the execution policy for coordinator andits elements (workflow actions) timeout (actions)concurrency (actions)
execution order (workflow
instances)action Required singular element specifying the
associated workflow. The jobs specified in
workflow consume and produce dataset
instances
Workflow name
datasets Collection of data referred to by a logical name.Datasets serve to specify data dependencesbetween workflow instances
input event specifies the input conditions (in the form ofpresent data sets) that are required in order to
execute a coordinator actionoutput event specifies the dataset that should be produced
by coordinator action
-
7/22/2019 Everything You want to know about oozie
14/31
Coordinator lifecycle
-
7/22/2019 Everything You want to know about oozie
15/31
Oozie Bundle
-
7/22/2019 Everything You want to know about oozie
16/31
Bundle lifecycle
RUNNINGPREPSUSPENDED KILLED
SUSPENDED
PREP
FAILEDSUCCEDDED
PAUSED
PREPPAUSED
-
7/22/2019 Everything You want to know about oozie
17/31
Oozie SLA
-
7/22/2019 Everything You want to know about oozie
18/31
SLA Navigation
event_id
alert_contact
alert-frieuency
sla_id
...
SLA_EVENT
id
app_name
app_path
COORD_JOBS
id
action_number action_xml
external_id
...
COORD_ACTIONS
id
conf
console_url
id app_name
app_path
WF_ACTIONS
WF_JOBS
-
7/22/2019 Everything You want to know about oozie
19/31
-
7/22/2019 Everything You want to know about oozie
20/31
Using Probes to analyze/monitor Places
Select probe data for specified time/location
Validate Filter - Transform probe data
Calculate statistics on available probe data
Distribute data per geo-tiles
Calculate place statistics (e.g. attendance index)
-------------------------------------------------------------
If exception condition happens, report failure
If all steps succeeded, report success
W kfl li h
-
7/22/2019 Everything You want to know about oozie
21/31
Workflow as acyclic graph
f f
-
7/22/2019 Everything You want to know about oozie
22/31
Workflow fragment 1
W kfl f 2
-
7/22/2019 Everything You want to know about oozie
23/31
Workflow fragment 2
-
7/22/2019 Everything You want to know about oozie
24/31
Oozie tips and tricks
-
7/22/2019 Everything You want to know about oozie
25/31
Configuring workflow
Oozie provides 3 overlapping mechanisms to configure workflow -config-default.xml, jobs properties file and job arguments that canbe passed to Oozie as part of command line invocations.
The way Oozie processes these three sets of the parameters is asfollows:
Use all of the parameters from command line invocation
For remaining unresolved parameters, job config is used
Use config-default.xml for everything else
Although documentation does not describe clearly when to usewhich, the overall recommendation is as follows:
Use config-default.xml for defining parameters that never change for agiven workflow
Use jobs properties for the parameters that are common for a givendeployment of a workflow
Use command line arguments for the parameters that are specific fora given workflow invocation.
-
7/22/2019 Everything You want to know about oozie
26/31
Accessing and storing process
variables
Accessing
Through the arguments in java main
Storing
String ooziePropFileName =System.getProperty("oozie.action.output.properties");
OutputStream os = new FileOutputStream(new
File(ooziePropFileName));
Properties props = new Properties();
props.setProperty(key, value);
props.store(os, "");
os.close();
-
7/22/2019 Everything You want to know about oozie
27/31
Validating data presence
Oozie provides two possible approaches for validatingresource file(s) presence using Oozie coordinators input events based on the data set -
technically the simplest implementation approach, but it doesnot provide a more complex decision support that might be
required. It just either runs a corresponding workflow or not. custom java node inside Oozie workflow. - allows to extend
decision logic by sending notifications about data absence, runexecution on partial data under certain timing conditions, etc.
Additional configuration parameters for Oozie coordinator,
for example, ability to wait for files arrival, etc. can expandusage of Oozie coordinator.
-
7/22/2019 Everything You want to know about oozie
28/31
Invoking map Reduce jobs
Oozie provides two different ways of invoking Map Reducejob MapReduce action and java action.
Invocation of Map Reduce job with java action is somewhatsimilar to invocation of this job with Hadoop command linefrom the edge node. You specify a driver as a class for the
java activity and Oozie invokes the driver. This approachhas two main advantages: The same driver class can be used for both running Map
Reduce job from an edge node and a java action in an Oozieprocess.
A driver provides a convenient place for executing additionalcode, for example clean-up required for Map Reduce execution.
Driver requires a proper shutdown hook to ensure thatthere are no lingering Map Reduce jobs
-
7/22/2019 Everything You want to know about oozie
29/31
Implementing predefined looping and
forking
hPDL is an XML document with the well-definedschema.
This means that the actual workflow can be easilymanipulated using JAXB objects, which can be
generated from hPDL schema using xjc compiler. This means that we can create the complete
workflow programmatically, based on calculatedamount of fork branches or implementing loops
as a repeated actions. The other option is creation of template process
and modifying it based on calculated parameters.
-
7/22/2019 Everything You want to know about oozie
30/31
Oozie client security (or lack of)
By default Oozie client reads clients identity from thelocal machine OS and passes it to the Oozie server,which uses this identity for MR jobs invocation
Impersonation can be implemented by overwritingOozieClient class method createConfiguration, whereclient variables can be set through new constructor.
public Properties createConfiguration() {
Properties conf = new Properties();
if(user == null)
conf.setProperty(USER_NAME, System.getProperty("user.name"));
else
conf.setProperty(USER_NAME, user);
return conf;
}
-
7/22/2019 Everything You want to know about oozie
31/31
uber jars with Oozieuber jar contains resources: other jars, so libraries, zip files
${wfUberLauncher}
-appStart=${wfAppMain}
Oozie
server
launcher
java action
unpack resources
to current uber jar dir
set inverse classloader
invoke MR driver
pass arguments
set shutdown hookwait for complete
uber jar
Classes (Launcher)
jars so zip
mappermapper