5 IBM Presentation 268018 7

download 5 IBM Presentation 268018 7

of 43

Transcript of 5 IBM Presentation 268018 7

  • 7/28/2019 5 IBM Presentation 268018 7

    1/43

    2008 IBM Corporation1

    Three courses of DataStage, with a sideorder of TeradataStewart HannaProduct Manager

  • 7/28/2019 5 IBM Presentation 268018 7

    2/43

    2

    Information Management Software

    Agenda

    Platform Overview

    Appetizer - Productivity First Course Extensibility

    Second Course Scalability

    Desert Teradata

  • 7/28/2019 5 IBM Presentation 268018 7

    3/43

    3

    Information Management Software

    Agenda

    Platform Overview

    Appetizer - Productivity First Course Extensibility

    Second Course Scalability

    Desert Teradata

  • 7/28/2019 5 IBM Presentation 268018 7

    4/434

    Information Management Software

    Business

    Value

    Maturity of

    Information Use

    Accelerate to the Next Level

    Unlocking the Business Value of Information for Competitive Advantage

  • 7/28/2019 5 IBM Presentation 268018 7

    5/435

    Information Management Software

    An Industry Unique Information Platform

    Simplify the delivery of Trusted Information

    Accelerate client value

    Promote collaboration

    Mitigate risk

    Modular but Integrated

    Scalable Project to Enterprise

    The IBM InfoSphere Vision

  • 7/28/2019 5 IBM Presentation 268018 7

    6/436

    Information Management Software

    IBM InfoSphere Information Server

    Delivering information you can trust

  • 7/28/2019 5 IBM Presentation 268018 7

    7/437

    Information Management Software

    InfoSphere DataStage

    Provides codeless visual design ofdata flows with hundreds of buil t-intransformation functions

    Optimized reuse of integrationobjects

    Supports batch & real-timeoperations

    Produces reusable components thatcan be shared across projects

    Complete ETL functionality withmetadata-driven productivity

    Supports team-based developmentand collaboration

    Provides integration from across thebroadest range of sources

    Transform

    Transform and aggregate any volume

    of information in batch or real timethrough visually designed logic

    Hundreds of Built-in

    Transformation Functions

    ArchitectsDevelopers

    InfoSphere DataStage

    Deliver

  • 7/28/2019 5 IBM Presentation 268018 7

    8/438

    Information Management Software

    Agenda

    Platform Overview

    Appetizer - Productivity First Course Extensibility

    Second Course Scalability

    Desert Teradata

  • 7/28/2019 5 IBM Presentation 268018 7

    9/43

    9

    Information Management Software

    DataStage Designer

    Completedevelopment

    environment Graphical, drag and

    drop top down design

    metaphor

    Develop sequentially,deploy in parallel

    Component-basedarchitecture

    Reuse capabilities

  • 7/28/2019 5 IBM Presentation 268018 7

    10/43

    10

    Information Management Software

    Transformation Components

    Over 60 pre-built componentsavailable including

    Files

    Database

    Lookup

    Sort, Aggregation,

    Transformer

    Pivot, CDC

    J oin, Merge

    Filter, Funnel, Switch, Modify

    Remove Duplicates

    Restructure stages

    Sample ofStages

    Avai lable

  • 7/28/2019 5 IBM Presentation 268018 7

    11/43

    11

    Information Management Software

    Some Popular Stages

    Usual ETL Sources & Targets:- RDBMS, Sequential File, Data Set

    Combining Data:- Lookup, Joins, Merge

    - Aggregator

    Transform Data:- Transformer, Remove Duplicates

    Ancillary:- Row Generator, Peek, Sort

  • 7/28/2019 5 IBM Presentation 268018 7

    12/43

    12

    Information Management Software

    Deployment

    Easy and integrated jobmovement from one environmentto the other

    A deployment package can becreated with any first class objectsfrom the repository. New objects

    can be added to an existingpackage.

    The description of this package isstored in the metadata repository.

    All associated objects can beadded in like files outside of DSand QS like scripts.

    Audit and Security Control

  • 7/28/2019 5 IBM Presentation 268018 7

    13/43

    13

    Information Management Software

    Role-Based Tools with Integrated Metadata

    Simpli fy Integration Increase trust andconfidence in information

    Increase compliance tostandards

    Facil itate changemanagement & reuseDesign Operational

    DevelopersSubject MatterExperts

    DataAnalysts

    BusinessUsers

    Architects DBAs

    Unified Metadata Management

  • 7/28/2019 5 IBM Presentation 268018 7

    14/43

    14

    Information Management Software

    Business analysts and ITcollaborate in context tocreate project specification

    Leverages sourceanalysis, target models,and metadata to facilitatemapping process

    Auto-generation ofdata transformationjobs & reports

    Generate historicaldocumentation for tracking

    Supports data governance

    Flexible Reporting

    Specification

    Auto-generatesDataStage jobs

    InfoSphere FastTrackTo reduce Costs of Integration Projects through Automation

  • 7/28/2019 5 IBM Presentation 268018 7

    15/43

    15

    Information Management Software

    Agenda Platform Overview

    Appetizer - Productivity First Course Extensibility

    Second Course Scalability

    Desert Teradata

  • 7/28/2019 5 IBM Presentation 268018 7

    16/43

    16

    Information Management Software

    Extending DataStage - Defining Your Own Stage Types

    Define your own stage type to beintegrated into data flow

    Stage Types

    Wrapped Specify a OS command or

    script

    Existing Routines, Logic, Apps

    BuildOp Wizard / Macro driven

    Development

    Custom

    API Development

    Available to all jobs in the project

    All meta data is captured

  • 7/28/2019 5 IBM Presentation 268018 7

    17/43

    17

    Information Management Software

    Building WrappedStages

    In a nutshell:

    You can wrap a legacy executable:

    binary, Unix command,

    shell script

    and turn it into a bona fide DataStage stage capable, amongother things, of parallel execution,

    as long as the legacy executable is

    amenable to data-partition parallelism

    no dependencies between rows

    pipe-safe

    can read rows sequentially

    no random access to data, e.g., use of fseek()

  • 7/28/2019 5 IBM Presentation 268018 7

    18/43

    18

    Information Management Software

    Building BuildOp Stages In a nutshell:

    The user performs the fun, glamorous tasks:encapsulate business logic in a custom operator

    The DataStage wizard called buildop automaticallyperforms the unglamorous, tedious, error-prone tasks:

    invoke needed header files, build the necessaryplumbing for a correct and efficient parallel

    execution.

  • 7/28/2019 5 IBM Presentation 268018 7

    19/43

    19

    Information Management Software

    Layout interfaces describe what columns the stage:

    needs for its inputs

    creates for its outputs

    Two kinds of interfaces: dynamic and static

    Dynamic: adjusts to itsinputs automatically

    Custom Stages can beDynamic or Static

    IBM DataStage suppliedStages are dynamic

    Static: expects input to

    contain columns withspecific names and types

    BuildOp are only Static

    Major difference between BuildOp and Custom Stages

    I f ti M t S ft

  • 7/28/2019 5 IBM Presentation 268018 7

    20/43

    20

    Information Management Software

    Agenda Platform Overview

    Appetizer - Productivity

    First Course Extensibility

    Second Course Scalability

    Desert Teradata

    I f ti M t S ft

  • 7/28/2019 5 IBM Presentation 268018 7

    21/43

    21

    Information Management Software

    Scalability is important everywhere

    It take me 4 1/2 hours to wash, dry and fold 3 loads oflaundry ( hour for each operation)

    Wash Dry Fold

    Sequential Approach (4 Hours)

    Wash a load, dry the load, fold it

    Wash-dry-fold

    Wash-dry-fold

    Wash Dry FoldPipeline Approach (2 Hours)

    Wash a load, when it is done, put it in the dryer, etc

    Load washing while another load is drying, etc

    Wash Dry FoldWash Dry Fold

    Partitioned Parallelism Approach (1 Hours)

    Divide the laundry into different loads (whites, darks,linens)

    Work on each piece independently

    3 Times faster with 3 Washing machines, continue with 3dryers and so on

    Wash Dry FoldPartitio

    nin

    g

    Repartitio

    nin

    g

    Whites, Darks, Linens

    WorkclothesDelicates, Lingerie, Outside

    LinensShirts on Hangers, pants

    folded

    Partitio

    nin

    g

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    22/43

    Information Management Software

    SourceData

    Processor 1

    Processor 2

    Processor 3

    Processor 4

    A-F

    G- M

    N-T

    U-Z

    Break up big data into partitionsRun one partition on each processor4X times faster on 4 processors; 100X faster on 100 processors

    Partitioning is specified per stage meaning partitioning can changebetween stages

    Transform

    Transform

    Transform

    Transform

    Data Partitioning

    Typeso

    f

    Partitioning

    options

    available

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    23/43

    Information Management Software

    Data Pipelining

    Source Target

    Archived Data

    1,000 recordsfirst chunk

    1,000 recordssecond chunk

    1,000 recordsthird chunk

    1,000 recordsfourth chunk

    1,000 recordsfifth chunk

    1,000 recordssixth chunk

    Records 9,001to 100,000,000

    1,000 recordsseventh chunk

    1,000 recordseighth chunk

    Load

    Eliminate the write to disk and the read from diskbetween processes

    Start a downstream process while an upstreamprocess is still running.

    This eliminates intermediate staging to disk, which iscritical for big data.

    This also keeps the processors busy.

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    24/43

    Information Management Software

    TargetSource

    Partiti

    onin

    g

    Transform

    Parallel Dataflow

    Parallel Processing achieved in a data flow

    Still limiting

    Partitioning remains constant throughout flow

    Not realistic for any realjobs

    For example, what if transformations are based on customer idand enrichment is a house holding task (i.e., based on post code)

    Aggregate Load

    SourceData

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    25/43

    Information Management Software

    TargetSource

    Pipelining

    SourceData

    Partitio

    nin

    g

    A-F

    G- MN-T

    U-Z

    Customer last name Customer Postcode Credit card number

    LoadRep

    artitio

    ning

    Rep

    artitio

    nin

    g

    Transform Aggregate

    Parallel Data Flow with Auto Repartitioning

    Record repartitioning occurs automatically

    No need to repartition data as you

    add processors

    change hardware architecture

    Broad range of partitioning methods

    Entire, hash, modulus, random, round robin, same, DB2 range

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    26/43

    26

    Information Management Software

    Partitio

    nin

    g

    DataWarehouse

    SourceData A-F

    G- MN-T

    U-Z

    Repartiti

    oning

    Repartitio

    nin

    g

    Sequential

    4-way parallel

    128-way parallel

    32 4-way nodes

    ....

    ....

    ....

    Appl icat ion Execution: Sequential or Parallel

    Partitio

    ning

    Target

    DataWarehouse

    Source

    SourceData A-F

    G- MN-T

    U-Z

    Customer last name Custzip code Credit card number

    Repartiti

    oning

    Repartitio

    ning

    Target

    SourceData

    DataWarehouse

    Source

    Target

    SourceData

    DataWarehouse

    Source

    Partitio

    nin

    g

    DataWarehouse

    SourceData A-F

    G- M

    N-TU-Z

    Repartition

    ing

    Repartition

    ing

    Application Assembly: One Dataf low Graph Created With the DataStage GUI

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    27/43

    27

    Information Management Software

    Data Set stage: allows you to read data from or write data to a data set.Parallel Extender data sets hide the complexities of handling and storing largecollections of records in parallel across the disks of a parallel computer.

    File Set stage: allows you to read data from or write data to afile set. It only executes in parallel mode.

    Sequential File stage: reads data from or writes data to one or moreflat files. It usually executes in parallel, but can be configured to executesequentially.

    External Source stage: allows you to read data that isoutput from one or more source programs.

    External Target stage: allows you to write data to one or moretarget programs.

    Robust mechanisms for handling big data

  • 7/28/2019 5 IBM Presentation 268018 7

    28/43

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    29/43

    g

    DataStage SAS Stages

    Parallel SAS Data Set stage:

    Allows you to read data from or write data to a parallel SAS data set inconjunction with a SAS stage.

    A parallel SAS Data Set is a set of one or more sequential SAS datasets, with a header file specifying the names and locations of all of the

    component files.

    SAS stage:

    Executes part or all of a SAS application in parallel by processing

    parallel streams of data with parallel instances of SAS DATA andPROC steps.

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    30/43

    30

    The Configuration File{

    node "n1" {f ast name "s1"pool " " "n1" "s1" "app2" "sort "r esour ce di sk "/ or ch/ n1/ d1" {}r esour ce di sk "/ or ch/ n1/ d2" {"bi gdat a"}r esour ce scrat chdi sk "/ t emp" {"sor t "}

    }node "n2" {

    f ast name "s2"pool " " "n2" "s2" "app1"r esour ce di sk "/ or ch/ n2/ d1" {}r esour ce di sk "/ or ch/ n2/ d2" {"bi gdat a"}r esour ce scr at chdi sk " / t emp" {}

    }node "n3" {

    f ast name "s3"pool " " "n3" "s3" "app1"r esour ce di sk "/ or ch/ n3/ d1" {}r esour ce scr at chdi sk " / t emp" {}

    }node "n4" {

    f ast name "s4"pool " " "n4" "s4" "app1"r esour ce di sk "/ or ch/ n4/ d1" {}r esour ce scr at chdi sk " / t emp" {}

    }

    1

    43

    2

    Two key aspects:

    1. # nodes declared

    2. defining subset of resources"pool" for execution under"constraints," i.e., using asubset of resources

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    31/43

    31

    Dynamic GRID Capabilities GRID Tab to help manage execution across a grid

    Automatically reconfigures parallelism to fit GRID

    resources

    Managed through an external grid resource manager

    Available for Red Hat and Tivoli Work Scheduler Loadleveler

    Locks resources at execution t ime to ensure SLAs

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    32/43

    32

    J ob Monitoring & Logging Job Performance Analysis

    Detail job monitoring

    information available duringand after job execution

    Start and elapsed times

    Record counts per link

    % CPU used by each process

    Data skew across partitions

    Also available from

    command line dsj ob r epor t

    [ ]

    type =BASIC, DETAIL, XML

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    33/43

    33

    Provides automatic optimizationof data flows mappingtransformation logic to SQL

    Leverages investments in DBMShardware by executing data

    integration tasks with and withinthe DBMS

    Optimizes job run-time by

    allowing the developer to controlwhere the job or various parts ofthe job will execute.

    Transform and aggregate any volume

    of information in batch or real time

    through visually designed logic

    ArchitectsDevelopers

    Optimizing run time through intelligent

    use of DBMS hardware

    InfoSphere DataStage Balanced OptimizationData Transformation

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    34/43

    34

    Using Balanced Optimization

    originalDataStagejob

    c designjob

    DataStageDesigner

    job

    results

    d compile& run e verify

    rewrittenoptimized

    job

    f optimizejob

    Balanced Optimization

    g compile& run

    h choose different opt ionsand reoptimize

    i manually review/edit optimized job

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    35/43

    35

    Leveraging best-of-breed systems

    Optimization is not constrained to a singleimplementation style such as ETL or ELT

    InfoSphere DataStage BalancedOptimization fully harnesses availablecapacity and computing power in Teradata

    and DataStage

    Delivering unlimited scalability and

    performance through parallel executioneverywhere, all the time

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    36/43

    36

    Agenda Platform Overview

    Appetizer - Productivity

    First Course Extensibility

    Second Course Scalability

    Desert Teradata

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    37/43

    37

    InfoSphere Rich Connectivity

    Rich

    Connectivity

    Shared, easy to

    use connectivity

    infrastructure

    Event-driven,

    real-time, and

    batch

    connectivity

    High volume,

    parallel connectivity

    to databases and

    file systems

    Best-of-breed,

    metadata-driven

    connectivity to

    enterprise applications

    Frictionless Connectivity

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    38/43

    38

    Teradata Connectivity

    7+ highly optimized interfaces forTeradata leveraging Teradata Toolsand Utilities: FastLoad, FastExport,

    TPUMP, MultiLoad, TeradataParallel Transport, CLI and ODBC

    You use the best interfacesspecifically designed for your

    integration requirements:

    Parallel/bulk extracts/loads withvarious/optimized data partitioning

    optionsTable maintenance

    Real-time/transactional trickle

    feeds without table locking

    FastLoad

    FastExport

    MulitLoad

    TPUMP

    TPT

    CLI

    ODBC

    TeradataEDW

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    39/43

    39

    Teradata Connectivty RequestedSessions determines the total number of distr ibuted

    connections to the Teradata source or target

    When not specified, it equals the number of Teradata VPROCs

    (AMPs) Can set between 1 and number of VPROCs

    SessionsPerPlayerdetermines the number of connections each player willhave to Teradata. Indirectly, it also determines the number of players

    (degree of parallelism). Default is 2 sessions / player

    The number selected should be such thatSessionsPerPlayer * number of nodes * number of players pernode = RequestedSessions

    Setting the value ofSessionsPerPlayertoo low on a large systemcan result in so many players that the job fails due to insufficientresources. In that case, the value for -SessionsPerPlayer shouldbe increased.

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    40/43

    40

    Teradata Server

    MPP with 4 TPA nodes

    4 AMPs per TPA node

    DataStage Server

    Configuration File Sessions Per Player Total Sessions

    16 nodes 1 16

    8 nodes 1 8

    8 nodes 2 16

    4 nodes 4 16

    Example Settings

    Teradata SessionsPerPlayer Example

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    41/43

    41

    Customer Example Massive Throughput

    CSA Production Linux Grid

    Teradata

    TeradataNode

    TeradataNode

    TeradataNode

    TeradataNode

    TeradataNode

    TeradataNode

    TeradataNode

    TeradataNode

    TeradataNode

    TeradataNode

    Head Node Compute A Compute B

    TeradataNode

    TeradataNode

    Compute C Compute D Compute E

    24 Port Switch24 Port Switch

    32 GbInterconnect

    Legend

    Teradata

    Linux Grid

    Stacked Switches

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    42/43

    42

    The IBM InfoSphere Information Server Advantage

    A Complete Information Infrastructure

    A comprehensive, unified foundation for enterprise informationarchitectures, scalable to any volume and processing requirement

    Auditable data quality as a foundation for trusted informationacross the enterprise

    Metadata-driven integration, providing breakthrough product ivityand flexibility for integrating and enriching information

    Consistent, reusable information services along withapplication services and process services, an enterprise essential

    Accelerated time to value with proven, industry-aligned solut ionsand expertise

    Broadest and deepest connectivity to information across diversesources: structured, unstructured, mainframe, and applications

    Information Management Software

  • 7/28/2019 5 IBM Presentation 268018 7

    43/43

    43