5 IBM Presentation 268018 7

7/28/2019 5 IBM Presentation 268018 7

1/43

2008 IBM Corporation1

Three courses of DataStage, with a sideorder of TeradataStewart HannaProduct Manager


2/43

2

Information Management Software

Agenda

Platform Overview

Appetizer - Productivity First Course Extensibility

Second Course Scalability

Desert Teradata


3/43

3


Agenda

Platform Overview



Desert Teradata


4/434


Business

Value

Maturity of

Information Use

Accelerate to the Next Level

Unlocking the Business Value of Information for Competitive Advantage


5/435


An Industry Unique Information Platform

Simplify the delivery of Trusted Information

Accelerate client value

Promote collaboration

Mitigate risk

Modular but Integrated

Scalable Project to Enterprise

The IBM InfoSphere Vision


6/436


IBM InfoSphere Information Server

Delivering information you can trust


7/437


InfoSphere DataStage

Provides codeless visual design ofdata flows with hundreds of buil t-intransformation functions

Optimized reuse of integrationobjects

Supports batch & real-timeoperations

Produces reusable components thatcan be shared across projects

Complete ETL functionality withmetadata-driven productivity

Supports team-based developmentand collaboration

Provides integration from across thebroadest range of sources

Transform

Transform and aggregate any volume

of information in batch or real timethrough visually designed logic

Hundreds of Built-in

Transformation Functions

ArchitectsDevelopers

InfoSphere DataStage

Deliver


8/438


Agenda

Platform Overview



Desert Teradata


9/43

9


DataStage Designer

Completedevelopment

environment Graphical, drag and

drop top down design

metaphor

Develop sequentially,deploy in parallel

Component-basedarchitecture

Reuse capabilities


10/43

10


Transformation Components

Over 60 pre-built componentsavailable including

Files

Database

Lookup

Sort, Aggregation,

Transformer

Pivot, CDC

J oin, Merge

Filter, Funnel, Switch, Modify

Remove Duplicates

Restructure stages

Sample ofStages

Avai lable


11/43

11


Some Popular Stages

Usual ETL Sources & Targets:- RDBMS, Sequential File, Data Set

Combining Data:- Lookup, Joins, Merge

- Aggregator

Transform Data:- Transformer, Remove Duplicates

Ancillary:- Row Generator, Peek, Sort


12/43

12


Deployment

Easy and integrated jobmovement from one environmentto the other

A deployment package can becreated with any first class objectsfrom the repository. New objects

can be added to an existingpackage.

The description of this package isstored in the metadata repository.

All associated objects can beadded in like files outside of DSand QS like scripts.

Audit and Security Control


13/43

13


Role-Based Tools with Integrated Metadata

Simpli fy Integration Increase trust andconfidence in information

Increase compliance tostandards

Facil itate changemanagement & reuseDesign Operational

DevelopersSubject MatterExperts

DataAnalysts

BusinessUsers

Architects DBAs

Unified Metadata Management


14/43

14


Business analysts and ITcollaborate in context tocreate project specification

Leverages sourceanalysis, target models,and metadata to facilitatemapping process

Auto-generation ofdata transformationjobs & reports

Generate historicaldocumentation for tracking

Supports data governance

Flexible Reporting

Specification

Auto-generatesDataStage jobs

InfoSphere FastTrackTo reduce Costs of Integration Projects through Automation


15/43

15


Agenda Platform Overview



Desert Teradata


16/43

16


Extending DataStage - Defining Your Own Stage Types

Define your own stage type to beintegrated into data flow

Stage Types

Wrapped Specify a OS command or

script

Existing Routines, Logic, Apps

BuildOp Wizard / Macro driven

Development

Custom

API Development

Available to all jobs in the project

All meta data is captured


17/43

17


Building WrappedStages

In a nutshell:

You can wrap a legacy executable:

binary, Unix command,

shell script

and turn it into a bona fide DataStage stage capable, amongother things, of parallel execution,

as long as the legacy executable is

amenable to data-partition parallelism

no dependencies between rows

pipe-safe

can read rows sequentially

no random access to data, e.g., use of fseek()


18/43

18


Building BuildOp Stages In a nutshell:

The user performs the fun, glamorous tasks:encapsulate business logic in a custom operator

The DataStage wizard called buildop automaticallyperforms the unglamorous, tedious, error-prone tasks:

invoke needed header files, build the necessaryplumbing for a correct and efficient parallel

execution.


19/43

19


Layout interfaces describe what columns the stage:

needs for its inputs

creates for its outputs

Two kinds of interfaces: dynamic and static

Dynamic: adjusts to itsinputs automatically

Custom Stages can beDynamic or Static

IBM DataStage suppliedStages are dynamic

Static: expects input to

contain columns withspecific names and types

BuildOp are only Static

Major difference between BuildOp and Custom Stages

I f ti M t S ft


20/43

20



Appetizer - Productivity

First Course Extensibility


Desert Teradata

I f ti M t S ft


21/43

21


Scalability is important everywhere

It take me 4 1/2 hours to wash, dry and fold 3 loads oflaundry ( hour for each operation)

Wash Dry Fold

Sequential Approach (4 Hours)

Wash a load, dry the load, fold it

Wash-dry-fold

Wash-dry-fold

Wash Dry FoldPipeline Approach (2 Hours)

Wash a load, when it is done, put it in the dryer, etc

Load washing while another load is drying, etc

Wash Dry FoldWash Dry Fold

Partitioned Parallelism Approach (1 Hours)

Divide the laundry into different loads (whites, darks,linens)

Work on each piece independently

3 Times faster with 3 Washing machines, continue with 3dryers and so on

Wash Dry FoldPartitio

nin

g

Repartitio

nin

g

Whites, Darks, Linens

WorkclothesDelicates, Lingerie, Outside

LinensShirts on Hangers, pants

folded

Partitio

nin

g



22/43


SourceData

Processor 1

Processor 2

Processor 3

Processor 4

A-F

G- M

N-T

U-Z

Break up big data into partitionsRun one partition on each processor4X times faster on 4 processors; 100X faster on 100 processors

Partitioning is specified per stage meaning partitioning can changebetween stages

Transform

Transform

Transform

Transform

Data Partitioning

Typeso

f

Partitioning

options

available



23/43


Data Pipelining

Source Target

Archived Data

1,000 recordsfirst chunk

1,000 recordssecond chunk

1,000 recordsthird chunk

1,000 recordsfourth chunk

1,000 recordsfifth chunk

1,000 recordssixth chunk

Records 9,001to 100,000,000

1,000 recordsseventh chunk

1,000 recordseighth chunk

Load

Eliminate the write to disk and the read from diskbetween processes

Start a downstream process while an upstreamprocess is still running.

This eliminates intermediate staging to disk, which iscritical for big data.

This also keeps the processors busy.



24/43


TargetSource

Partiti

onin

g

Transform

Parallel Dataflow

Parallel Processing achieved in a data flow

Still limiting

Partitioning remains constant throughout flow

Not realistic for any realjobs

For example, what if transformations are based on customer idand enrichment is a house holding task (i.e., based on post code)

Aggregate Load

SourceData



25/43


TargetSource

Pipelining

SourceData

Partitio

nin

g

A-F

G- MN-T

U-Z

Customer last name Customer Postcode Credit card number

LoadRep

artitio

ning

Rep

artitio

nin

g

Transform Aggregate

Parallel Data Flow with Auto Repartitioning

Record repartitioning occurs automatically

No need to repartition data as you

add processors

change hardware architecture

Broad range of partitioning methods

Entire, hash, modulus, random, round robin, same, DB2 range



26/43

26


Partitio

nin

g

DataWarehouse

SourceData A-F

G- MN-T

U-Z

Repartiti

oning

Repartitio

nin

g

Sequential

4-way parallel

128-way parallel

32 4-way nodes

....

....

....

Appl icat ion Execution: Sequential or Parallel

Partitio

ning

Target

DataWarehouse

Source

SourceData A-F

G- MN-T

U-Z

Customer last name Custzip code Credit card number

Repartiti

oning

Repartitio

ning

Target

SourceData

DataWarehouse

Source

Target

SourceData

DataWarehouse

Source

Partitio

nin

g

DataWarehouse

SourceData A-F

G- M

N-TU-Z

Repartition

ing

Repartition

ing

Application Assembly: One Dataf low Graph Created With the DataStage GUI



27/43

27


Data Set stage: allows you to read data from or write data to a data set.Parallel Extender data sets hide the complexities of handling and storing largecollections of records in parallel across the disks of a parallel computer.

File Set stage: allows you to read data from or write data to afile set. It only executes in parallel mode.

Sequential File stage: reads data from or writes data to one or moreflat files. It usually executes in parallel, but can be configured to executesequentially.

External Source stage: allows you to read data that isoutput from one or more source programs.

External Target stage: allows you to write data to one or moretarget programs.

Robust mechanisms for handling big data


28/43



29/43

g

DataStage SAS Stages

Parallel SAS Data Set stage:

Allows you to read data from or write data to a parallel SAS data set inconjunction with a SAS stage.

A parallel SAS Data Set is a set of one or more sequential SAS datasets, with a header file specifying the names and locations of all of the

component files.

SAS stage:

Executes part or all of a SAS application in parallel by processing

parallel streams of data with parallel instances of SAS DATA andPROC steps.



30/43

30

The Configuration File{

node "n1" {f ast name "s1"pool " " "n1" "s1" "app2" "sort "r esour ce di sk "/ or ch/ n1/ d1" {}r esour ce di sk "/ or ch/ n1/ d2" {"bi gdat a"}r esour ce scrat chdi sk "/ t emp" {"sor t "}

}node "n2" {

f ast name "s2"pool " " "n2" "s2" "app1"r esour ce di sk "/ or ch/ n2/ d1" {}r esour ce di sk "/ or ch/ n2/ d2" {"bi gdat a"}r esour ce scr at chdi sk " / t emp" {}

}node "n3" {

f ast name "s3"pool " " "n3" "s3" "app1"r esour ce di sk "/ or ch/ n3/ d1" {}r esour ce scr at chdi sk " / t emp" {}

}node "n4" {

f ast name "s4"pool " " "n4" "s4" "app1"r esour ce di sk "/ or ch/ n4/ d1" {}r esour ce scr at chdi sk " / t emp" {}

}

1

43

2

Two key aspects:

1. # nodes declared

2. defining subset of resources"pool" for execution under"constraints," i.e., using asubset of resources



31/43

31

Dynamic GRID Capabilities GRID Tab to help manage execution across a grid

Automatically reconfigures parallelism to fit GRID

resources

Managed through an external grid resource manager

Available for Red Hat and Tivoli Work Scheduler Loadleveler

Locks resources at execution t ime to ensure SLAs



32/43

32

J ob Monitoring & Logging Job Performance Analysis

Detail job monitoring

information available duringand after job execution

Start and elapsed times

Record counts per link

% CPU used by each process

Data skew across partitions

Also available from

command line dsj ob r epor t

[ ]

type =BASIC, DETAIL, XML



33/43

33

Provides automatic optimizationof data flows mappingtransformation logic to SQL

Leverages investments in DBMShardware by executing data

integration tasks with and withinthe DBMS

Optimizes job run-time by

allowing the developer to controlwhere the job or various parts ofthe job will execute.

Transform and aggregate any volume

of information in batch or real time

through visually designed logic

ArchitectsDevelopers

Optimizing run time through intelligent

use of DBMS hardware

InfoSphere DataStage Balanced OptimizationData Transformation



34/43

34

Using Balanced Optimization

originalDataStagejob

c designjob

DataStageDesigner

job

results

d compile& run e verify

rewrittenoptimized

job

f optimizejob

Balanced Optimization

g compile& run

h choose different opt ionsand reoptimize

i manually review/edit optimized job



35/43

35

Leveraging best-of-breed systems

Optimization is not constrained to a singleimplementation style such as ETL or ELT

InfoSphere DataStage BalancedOptimization fully harnesses availablecapacity and computing power in Teradata

and DataStage

Delivering unlimited scalability and

performance through parallel executioneverywhere, all the time



36/43

36


Appetizer - Productivity

First Course Extensibility


Desert Teradata



37/43

37

InfoSphere Rich Connectivity

Rich

Connectivity

Shared, easy to

use connectivity

infrastructure

Event-driven,

real-time, and

batch

connectivity

High volume,

parallel connectivity

to databases and

file systems

Best-of-breed,

metadata-driven

connectivity to

enterprise applications

Frictionless Connectivity



38/43

38

Teradata Connectivity

7+ highly optimized interfaces forTeradata leveraging Teradata Toolsand Utilities: FastLoad, FastExport,

TPUMP, MultiLoad, TeradataParallel Transport, CLI and ODBC

You use the best interfacesspecifically designed for your

integration requirements:

Parallel/bulk extracts/loads withvarious/optimized data partitioning

optionsTable maintenance

Real-time/transactional trickle

feeds without table locking

FastLoad

FastExport

MulitLoad

TPUMP

TPT

CLI

ODBC

TeradataEDW



39/43

39

Teradata Connectivty RequestedSessions determines the total number of distr ibuted

connections to the Teradata source or target

When not specified, it equals the number of Teradata VPROCs

(AMPs) Can set between 1 and number of VPROCs

SessionsPerPlayerdetermines the number of connections each player willhave to Teradata. Indirectly, it also determines the number of players

(degree of parallelism). Default is 2 sessions / player

The number selected should be such thatSessionsPerPlayer * number of nodes * number of players pernode = RequestedSessions

Setting the value ofSessionsPerPlayertoo low on a large systemcan result in so many players that the job fails due to insufficientresources. In that case, the value for -SessionsPerPlayer shouldbe increased.



40/43

40

Teradata Server

MPP with 4 TPA nodes

4 AMPs per TPA node

DataStage Server

Configuration File Sessions Per Player Total Sessions

16 nodes 1 16

8 nodes 1 8

8 nodes 2 16

4 nodes 4 16

Example Settings

Teradata SessionsPerPlayer Example



41/43

41

Customer Example Massive Throughput

CSA Production Linux Grid

Teradata

TeradataNode

TeradataNode

TeradataNode

TeradataNode

TeradataNode

TeradataNode

TeradataNode

TeradataNode

TeradataNode

TeradataNode

Head Node Compute A Compute B

TeradataNode

TeradataNode

Compute C Compute D Compute E

24 Port Switch24 Port Switch

32 GbInterconnect

Legend

Teradata

Linux Grid

Stacked Switches



42/43

42

The IBM InfoSphere Information Server Advantage

A Complete Information Infrastructure

A comprehensive, unified foundation for enterprise informationarchitectures, scalable to any volume and processing requirement

Auditable data quality as a foundation for trusted informationacross the enterprise

Metadata-driven integration, providing breakthrough product ivityand flexibility for integrating and enriching information

Consistent, reusable information services along withapplication services and process services, an enterprise essential

Accelerated time to value with proven, industry-aligned solut ionsand expertise

Broadest and deepest connectivity to information across diversesources: structured, unstructured, mainframe, and applications



43/43

43

5 IBM Presentation 268018 7

Documents

Transcript of 5 IBM Presentation 268018 7