Everything You want to know about oozie

download Everything You want to know about oozie

of 31

Transcript of Everything You want to know about oozie

  • 7/22/2019 Everything You want to know about oozie

    1/31

    Everything that you ever wanted to

    know about Oozie, but were afraid

    to ask

    B Lublinsky, A Yakubovich

  • 7/22/2019 Everything You want to know about oozie

    2/31

    Apache Oozie

    Oozie is a workflow/coordination system tomanage Apache Hadoop jobs.

    A single Oozie server implements all four

    functional Oozie components: Oozie workflow

    Oozie coordinator

    Oozie bundle Oozie SLA.

  • 7/22/2019 Everything You want to know about oozie

    3/31

    Main components

    Data

    Oozie Server

    Coordinator

    Coordinator

    Hadoop

    Coordinator

    Oozie Command

    Line Interface

    3rd party application

    definitions,states

    WS API

    job submissionand monitoring

    workflow

    action

    action

    action

    action

    Oozie shared

    libraries

    Coordinator

    wf logic

    Bundle

    Coordinator

    Coordinator

    Bundle

    Coordinator

    Coordinator

    Workflow

    time condition monitoring

    data condition monitoring

    HDFS

    MapReduce

  • 7/22/2019 Everything You want to know about oozie

    4/31

    Oozie workflow

  • 7/22/2019 Everything You want to know about oozie

    5/31

    Workflow LanguageFlow-control

    node

    XML element type DescriptionDecision workflow:DECISION expressing switch-case logicFork workflow:FORK splits one path of execution into multiple concurrent pathsJoin workflow:JOIN waits until every concurrent execution path of a previous fork

    node arrives to itKill workflow:kill forces a workflow job to kill (abort) itselfAction node XML element type

    Description

    java workflow:JAVA invokes the main() method from the specified java classfs workflow:FS manipulate files and directories in HDFS; supports commands:

    move, delete, mkdirMapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,

    streaming job or pipe jobPig workflow:pig runs a Pig jobSub workflow workflow:SUB-

    WORKFLOWruns a child workflow job

    Hive * workflow:HIVE runs a Hive jobShell * workflow:SHELL runs a Shell commandssh * workflow:SSH starts a shell command on a remote machine as a remote secure

    shellSqoop * workflow:SQOOP runs a Sqoop jobEmail * workflow:EMAIL sending emails from Oozie workflow applicationDistcp ? Under development (Yahoo)

  • 7/22/2019 Everything You want to know about oozie

    6/31

    Workflow actions

    ActionStartCommand JavaAct ionExecutorWorkflowStore Services JobClientActionExecutorContext

    1 : workflow := getWorkflow()

    2 : action := getAction()

    3 : context := init()

    4 : executor := get()

    5 : start()

    6 : submitLauncher()

    7 : jobClient := get()

    8 : runningJob := submit()

    9 : setStartData()

    Oozie workflow supports two types of actions:

    Synchronous, executed inside Oozie runtime Asynchronous, executed as a Map Reduce job.

  • 7/22/2019 Everything You want to know about oozie

    7/31

    Workflow lifecycle

    PREP

    RUNNINGKILLED

    SUSPENDED

    FAILED

    SUCCEDDED

  • 7/22/2019 Everything You want to know about oozie

    8/31

    Oozie execution console

  • 7/22/2019 Everything You want to know about oozie

    9/31

    Extending Oozie workflow

    Oozie provides a minimal workflow language, whichcontains only a handful of control and actions nodes.

    Oozie supports a very elegant extensibility mechanismcustom action nodes. Custom action nodes allow to extendOozie language with additional actions (verbs).

    Creation of custom action requires implementation offollowing: Java action implementation, which extends ActionExecutor

    class.

    Implementation of the actions XML schema defining actionsconfiguration parameters

    Packaging of java implementation and configuration schemainto action jar, which has to be added to Oozie war

    extending oozie-site.xml to register information about customexecutor with Oozie runtime.

  • 7/22/2019 Everything You want to know about oozie

    10/31

    Oozie Workflow Client Oozie provides an easy way for integration with enterprise

    applications through Oozie client APIs. It provides twotypes of APIs

    REST HTTP APINumber of HTTP requests

    Info requests (job status, job configuration)

    Job management (submit, start, suspend, resume, kill)

    Example: job definition info request

    GET /oozie/v0/job/job-ID?show=definition

    Java API - package org.apache.oozie.client OozieClient

    start(), submit(), run(), reRunXXX(), resume(), kill(), suspend()

    WorkflowJob, WorkflowAction

    CoordinatorJob, CoordinatorAction

    SLAEvent

  • 7/22/2019 Everything You want to know about oozie

    11/31

    Oozie workflow good, bad and ugly

    Good Nice integration with Hadoop ecosystem, allowing to easily build

    processes encompassing synchronized execution of multiple MapReduce, Hive, Pig, etc jobs.

    Nice UI for tracking execution progress

    Simple APIs for integration with other applications

    Simple extensibility APIs

    Bad

    Process has to be expressed directly in hPDL with no visual support

    No support for Uber Jars (but we added our own)

    Ugly Static forking (but you can regenerate workflow and invoke on a fly)

    No support for loops

  • 7/22/2019 Everything You want to know about oozie

    12/31

    Oozie Coordinator

  • 7/22/2019 Everything You want to know about oozie

    13/31

    Coordinator languageElement type Description Attributes and sub-elementscoordinator-

    app

    top-level element in coordinator instance frequency

    startend

    controls specify the execution policy for coordinator andits elements (workflow actions) timeout (actions)concurrency (actions)

    execution order (workflow

    instances)action Required singular element specifying the

    associated workflow. The jobs specified in

    workflow consume and produce dataset

    instances

    Workflow name

    datasets Collection of data referred to by a logical name.Datasets serve to specify data dependencesbetween workflow instances

    input event specifies the input conditions (in the form ofpresent data sets) that are required in order to

    execute a coordinator actionoutput event specifies the dataset that should be produced

    by coordinator action

  • 7/22/2019 Everything You want to know about oozie

    14/31

    Coordinator lifecycle

  • 7/22/2019 Everything You want to know about oozie

    15/31

    Oozie Bundle

  • 7/22/2019 Everything You want to know about oozie

    16/31

    Bundle lifecycle

    RUNNINGPREPSUSPENDED KILLED

    SUSPENDED

    PREP

    FAILEDSUCCEDDED

    PAUSED

    PREPPAUSED

  • 7/22/2019 Everything You want to know about oozie

    17/31

    Oozie SLA

  • 7/22/2019 Everything You want to know about oozie

    18/31

    SLA Navigation

    event_id

    alert_contact

    alert-frieuency

    sla_id

    ...

    SLA_EVENT

    id

    app_name

    app_path

    COORD_JOBS

    id

    action_number action_xml

    external_id

    ...

    COORD_ACTIONS

    id

    conf

    console_url

    id app_name

    app_path

    WF_ACTIONS

    WF_JOBS

  • 7/22/2019 Everything You want to know about oozie

    19/31

  • 7/22/2019 Everything You want to know about oozie

    20/31

    Using Probes to analyze/monitor Places

    Select probe data for specified time/location

    Validate Filter - Transform probe data

    Calculate statistics on available probe data

    Distribute data per geo-tiles

    Calculate place statistics (e.g. attendance index)

    -------------------------------------------------------------

    If exception condition happens, report failure

    If all steps succeeded, report success

    W kfl li h

  • 7/22/2019 Everything You want to know about oozie

    21/31

    Workflow as acyclic graph

    f f

  • 7/22/2019 Everything You want to know about oozie

    22/31

    Workflow fragment 1

    W kfl f 2

  • 7/22/2019 Everything You want to know about oozie

    23/31

    Workflow fragment 2

  • 7/22/2019 Everything You want to know about oozie

    24/31

    Oozie tips and tricks

  • 7/22/2019 Everything You want to know about oozie

    25/31

    Configuring workflow

    Oozie provides 3 overlapping mechanisms to configure workflow -config-default.xml, jobs properties file and job arguments that canbe passed to Oozie as part of command line invocations.

    The way Oozie processes these three sets of the parameters is asfollows:

    Use all of the parameters from command line invocation

    For remaining unresolved parameters, job config is used

    Use config-default.xml for everything else

    Although documentation does not describe clearly when to usewhich, the overall recommendation is as follows:

    Use config-default.xml for defining parameters that never change for agiven workflow

    Use jobs properties for the parameters that are common for a givendeployment of a workflow

    Use command line arguments for the parameters that are specific fora given workflow invocation.

  • 7/22/2019 Everything You want to know about oozie

    26/31

    Accessing and storing process

    variables

    Accessing

    Through the arguments in java main

    Storing

    String ooziePropFileName =System.getProperty("oozie.action.output.properties");

    OutputStream os = new FileOutputStream(new

    File(ooziePropFileName));

    Properties props = new Properties();

    props.setProperty(key, value);

    props.store(os, "");

    os.close();

  • 7/22/2019 Everything You want to know about oozie

    27/31

    Validating data presence

    Oozie provides two possible approaches for validatingresource file(s) presence using Oozie coordinators input events based on the data set -

    technically the simplest implementation approach, but it doesnot provide a more complex decision support that might be

    required. It just either runs a corresponding workflow or not. custom java node inside Oozie workflow. - allows to extend

    decision logic by sending notifications about data absence, runexecution on partial data under certain timing conditions, etc.

    Additional configuration parameters for Oozie coordinator,

    for example, ability to wait for files arrival, etc. can expandusage of Oozie coordinator.

  • 7/22/2019 Everything You want to know about oozie

    28/31

    Invoking map Reduce jobs

    Oozie provides two different ways of invoking Map Reducejob MapReduce action and java action.

    Invocation of Map Reduce job with java action is somewhatsimilar to invocation of this job with Hadoop command linefrom the edge node. You specify a driver as a class for the

    java activity and Oozie invokes the driver. This approachhas two main advantages: The same driver class can be used for both running Map

    Reduce job from an edge node and a java action in an Oozieprocess.

    A driver provides a convenient place for executing additionalcode, for example clean-up required for Map Reduce execution.

    Driver requires a proper shutdown hook to ensure thatthere are no lingering Map Reduce jobs

  • 7/22/2019 Everything You want to know about oozie

    29/31

    Implementing predefined looping and

    forking

    hPDL is an XML document with the well-definedschema.

    This means that the actual workflow can be easilymanipulated using JAXB objects, which can be

    generated from hPDL schema using xjc compiler. This means that we can create the complete

    workflow programmatically, based on calculatedamount of fork branches or implementing loops

    as a repeated actions. The other option is creation of template process

    and modifying it based on calculated parameters.

  • 7/22/2019 Everything You want to know about oozie

    30/31

    Oozie client security (or lack of)

    By default Oozie client reads clients identity from thelocal machine OS and passes it to the Oozie server,which uses this identity for MR jobs invocation

    Impersonation can be implemented by overwritingOozieClient class method createConfiguration, whereclient variables can be set through new constructor.

    public Properties createConfiguration() {

    Properties conf = new Properties();

    if(user == null)

    conf.setProperty(USER_NAME, System.getProperty("user.name"));

    else

    conf.setProperty(USER_NAME, user);

    return conf;

    }

  • 7/22/2019 Everything You want to know about oozie

    31/31

    uber jars with Oozieuber jar contains resources: other jars, so libraries, zip files

    ${wfUberLauncher}

    -appStart=${wfAppMain}

    Oozie

    server

    launcher

    java action

    unpack resources

    to current uber jar dir

    set inverse classloader

    invoke MR driver

    pass arguments

    set shutdown hookwait for complete

    uber jar

    Classes (Launcher)

    jars so zip

    mappermapper