Ascential DS 324PX - V70 - Student Reference

402
DS324PX DATASTAGE PX STUDENT REFERENCE

Transcript of Ascential DS 324PX - V70 - Student Reference

Page 1: Ascential DS 324PX - V70 - Student Reference

DS324PX DATASTAGE PX

STUDENT REFERENCE

Page 2: Ascential DS 324PX - V70 - Student Reference

DS324PX DataStage PX

Copyright © 2004 Ascential Software Corporation 1/1/04

Page 3: Ascential DS 324PX - V70 - Student Reference

DataStage PX DS325PX

Copyright This document and the software described herein are the property of Ascential Software Corporation and its licensors and contain confidential trade secrets. All rights to this publication are reserved. No part of this document may be reproduced, transmitted, transcribed, stored in a retrieval system or translated into any language, in any form or by any means, without prior permission from Ascential Software Corporation. Copyright © 2002 Ascential Software Corporation. All rights Reserved Ascential Software Corporation reserves the right to make changes to this document and the software described herein at any time and without notice. No warranty is expressed or implied other than any contained in the terms and conditions of sale.

Ascential Software Corporation 50 Washington Street

Westboro, MA 01581-1021 USA Phone: (508) 366-3888 Fax: (508) 389-8749

Ardent, Axielle, DataStage, Iterations, MetaBroker, MetaStage, and uniVerse are registered trademarks of Ascential Software Corporation. Pick is a registered trademark of Pick Systems. Ascential Software is not a licensee of Pick Systems. Other trademarks and registered trademarks are the property of the respective trademark holder.

01-01-2004

Page 4: Ascential DS 324PX - V70 - Student Reference

DS324PX DataStage PX

Copyright © 2004 Ascential Software Corporation 1/1/04

Page 5: Ascential DS 324PX - V70 - Student Reference

DataStage PX DS325PX

Table of Contents

Intro Part 1: Introduction....................................................1

Intro Part 2: Configuring Project ......................................13

Intro Part 3: Managing Meta Data ....................................25

Intro Part 4: Designing and Documenting Jobs ...............45

Intro Part 5: Running Jobs................................................75

Module 1: Review..............................................................85

Module 2: Sequential Access .........................................125

Module 3: Standards.......................................................157

Module 4: DBMS Access.................................................175

Module 5: PX Platform Architecture ..............................195

Module 6: Transforming Data .........................................229

Module 7: Sorting Data ...................................................249

Module 8: Combining Data..............................................261

Module 9: Configuration Files.........................................283

Module 10: Extending PX................................................305

Module 11: Meta Data.....................................................347

Module 12: Job Control...................................................369

Module 13: Testing .........................................................385

Page 6: Ascential DS 324PX - V70 - Student Reference

DS324PX DataStage PX

Copyright © 2004 Ascential Software Corporation 1/1/04

Page 7: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 1

IntroPart 1

Introduction to DataStage PX

Page 8: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 2

Ascential Platform

Page 9: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 3

What is DataStage?

Design jobs for Extraction, Transformation, and Loading (ETL)

Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations

Import, export, create, and managed metadata for use within jobs

Schedule, run, and monitor jobs all within DataStage

Administer your DataStage development and execution environments

DataStage is a comprehensive tool for the fast, easy creation and maintenance of data marts and data warehouses. It provides the tools you need to build, manage, and expand them. With DataStage, you can build solutions faster and give users access to the data and reports they need.With DataStage you can:• Design the jobs that extract, integrate, aggregate, load, and transform the data for your data warehouse or data mart.• Create and reuse metadata and job components.• Run, monitor, and schedule these jobs.• Administer your development and execution environments.

Page 10: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 4

DataStage Server and Clients

The DataStage client components are:AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository

Page 11: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 5

DataStage Administrator

Use the Administrator to specify general server defaults, add and delete projects, and to set project properties. The Administrator also provides a command interface to the UniVerse repository. • Use the Administrator Project Properties window to:• Set job monitoring limits and other Director defaults on the General tab.• Set user group privileges on the Permissions tab.• Enable or disable server-side tracing on the Tracing tab.• Specify a user name and password for scheduling jobs on the Schedule tab.• Specify hashed file stage read and write cache sizes on the Tunables tab.

Page 12: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 6

Client Logon

Page 13: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 7

DataStage Manager

Use the Manager to store and manage reusable metadata for the jobs you define in the Designer. This metadata includes table and file layouts and routines for transforming extracted data.Manager is also the primary interface to the DataStage repository. In addition to table and file layouts, it displays the routines, transforms, and jobs that are defined in the project. Custom routines and transforms can also be created in Manager.

Page 14: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 8

DataStage Designer

The DataStage Designer allows you to use familiar graphical point-and-click techniques to develop processes for extracting, cleansing, transforming, integrating and loading data into warehouse tables. The Designer provides a “visual data flow” method to easily interconnect and configure reusable components.

Page 15: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 9

DataStage Director

Use the Director to validate, run, schedule, and monitor your DataStage jobs. You can also gather statistics as the job runs.

Page 16: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 10

Developing in DataStage

Define global and project properties in Administrator

Import meta data into Manager

Build job in Designer

Compile Designer

Validate, run, and monitor in Director

• Define your project’s properties: Administrator• Open (attach to) your project• Import metadata that defines the format of data stores your jobs will read from or write to: Manager• Design the job: Designer

− Define data extractions (reads)− Define data flows− Define data integration− Define data transformations− Define data constraints− Define data loads (writes)− Define data aggregations

• Compile and debug the job: Designer• Run and monitor the job: Director

Page 17: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 11

DataStage Projects

All your work is done in a DataStage project. Before you can do anything, other than some general administration, you must open (attach to) a project. Projects are created during and after the installation process. You can add projects after installation on the Projects tab of Administrator.A project is associated with a directory. The project directory is used by DataStage to store your jobs and other DataStage objects and metadata.You must open (attach to) a project before you can do any work in it. Projects are self-contained. Although multiple projects can be open at the same time, they are separate environments. You can, however, import and export objects between them.Multiple users can be working in the same project at the same time. However, DataStage will prevent multiple users from accessing the same job at the same time.

Page 18: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 12

Quiz– True or False

DataStage Designer is used to build and compile your ETL jobs

Manager is used to execute your jobs after you build them

Director is used to execute your jobs after you build them

Administrator is used to set global and project properties

Page 19: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 13

IntroPart 2

Configuring Projects

Page 20: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 14

Module Objectives

After this module you will be able to:– Explain how to create and delete projects– Set project properties in Administrator– Set PX global properties in Administrator

Page 21: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 15

Project Properties

Projects can be created and deleted in Administrator

Project properties and defaults are set in Administrator

Recall from module 1: In DataStage all development work is done within a project. Projects are created during installation and after installation using Administrator. Each project is associated with a directory. The directory stores the objects (jobs, metadata, custom routines, etc.) created in the project.Before you can work in a project you must attach to it (open it).You can set the default properties of a project using DataStage Administrator.

Page 22: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 16

Setting Project Properties

To set project properties, log onto Administrator, select your project, and then click “Properties”

The logon screen for Administrator does not provide the option to select a specific project (unlike the other DataStage clients).

Page 23: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 17

Project Properties General Tab

The General Tab is used to change DataStage license information and to adjust the amount of time a connection will be maintained between the DataStage server and clients.

Page 24: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 18

Projects General Tab

Click Properties on the DataStage Administration window to open the Project Properties window. There are five active tabs. (The Mainframe tab is only enabled if your license supports mainframe jobs.) The default is the General tab. If you select the Enable job administration in Director box, you can perform some administrative functions in Director without opening Administrator.When a job is run in Director, events are logged describing the progress of the job. For example, events are logged when a job starts, when it stops, and when it aborts. The number of logged events can grow very large. The Auto-purge of job log box tab allows you to specify conditions for purging these events.You can limit the logged events either by number of days or number of job runs.

Page 25: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 19

Environment Variables

Page 26: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 20

Permissions Tab

Use this page to set user group permissions for accessing and using DataStage. All DataStage users must belong to a recognized user role before they can log on to DataStage. This helps to prevent unauthorized access to DataStage projects. There are three roles of DataStage user:• DataStage Developer, who has full access to all areas of a DataStage project.• DataStage Operator, who can run and manage released DataStage jobs.• <None>, who does not have permission to log on to DataStage.UNIX note: In UNIX, the groups displayed are defined in /etc/group.

Page 27: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 21

Tracing Tab

This tab is used to enable and disable server-side tracing.The default is for server-side tracing to be disabled. When you enable it, information about server activity is recorded for any clients that subsequently attach to the project. This information is written to trace files. Users with in-depth knowledge of the system software can use it to help identify the cause of a client problem. If tracing is enabled, users receive a warning message whenever they invoke a DataStage client.Warning: Tracing causes a lot of server system overhead. This should only be used to diagnose serious problems.

Page 28: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 22

Tunables Tab

On the Tunables tab, you can specify the sizes of the memory caches used when reading rows in hashed files and when writing rows to hashed files. Hashed files are mainly used for lookups and are discussed in a later module.

Page 29: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 23

Parallel Tab

You should enable OSH for viewing – OSH is generated when you compile a job.

Page 30: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 24

Page 31: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 25

IntroPart 3

Managing Meta Data

Page 32: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 26

Module Objectives

After this module you will be able to:– Describe the DataStage Manager components and

functionality– Import and export DataStage objects– Import metadata for a sequential file

Page 33: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 27

What Is Metadata?

TargetSource Transform

Meta DataRepository

Data

Meta Data

Meta Data

Metadata is “data about data” that describes the formats of sources and targets. This includes general format information such as whether the record columns are delimited and, if so, the delimiting character. It also includes the specific column definitions.

Page 34: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 28

DataStage Manager

DataStage Manager is a graphical tool for managing the contents of your DataStage project repository, which contains metadata and other DataStage components such as jobs and routines. The left pane contains the project tree. There are seven main branches, but you can create subfolders under each. Select a folder in the project tree to display its contents.

Page 35: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 29

Manager Contents

Metadata describing sources and targets: Table definitions

DataStage objects: jobs, routines, table definitions, etc.

DataStage Manager manages two different types of objects:• Metadata describing sources and targets:− Called table definitions in Manager. These are not to be confused with relational tables. DataStage table definitions are used to describe the format and column definitions of any type of source: sequential, relational, hashed file, etc.− Table definitions can be created in Manager or Designer and they can also be imported from the sources or targets they describe.• DataStage components− Every object in DataStage (jobs, routines, table definitions, etc.) is stored in the DataStage repository. Manager is the interface to this repository. − DataStage components, including whole projects, can be exported from and imported into Manager.

Page 36: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 30

Import and Export

Any object in Manager can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from one project to another

Use to share DataStage jobs and projects with other developers

Any set of DataStage objects, including whole projects, which are stored in the Manager Repository, can be exported to a file. This export file can then be imported back into DataStage.Import and export can be used for many purposes, including:• Backing up jobs and projects.• Maintaining different versions of a job or project.• Moving DataStage objects from one project to another. Just export the objects, move to the other project, then re-import them into the new project.• Sharing jobs and projects between developers. The export files, when zipped, are small and can be easily emailed from one developer to another.

Page 37: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 31

Export Procedure

In Manager, click “Export>DataStage Components”

Select DataStage objects for export

Specified type of export: DSX, XML

Specify file path on client machine

Click Export>DataStage Components in Manager to begin the export process.Any object in Manager can be exported to a file. Use this procedure to backup your work or to move DataStage objects from one project to another. Select the types of components to export. You can select either the whole project or select a portion of the objects in the project.Specify the name and path of the file to export to. By default, objects are exported to a text file in a special format. By default, the extension is dsx. Alternatively, you can export the objects to an XML document.The directory you export to is on the DataStage client, not the server.

Page 38: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 32

Quiz: True or False?

You can export DataStage objects such as jobs, but you can’t export metadata, such as field definitions of a sequential file.

True or False? You can export DataStage objects such as jobs, but you can't export metadata, such as field definitions of a sequential file.True: Incorrect. Metadata describing files and relational tables are stored as "Table Definitions". Table definitions can be exported and imported as any DataStage objects can.False: Correct! Metadata describing files and relational tables are stored as "Table Definitions". Table definitions can be exported and imported as any DataStage objects can.

Page 39: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 33

Quiz: True or False?

The directory to which you export is on the DataStage client machine, not on the DataStage server machine.

True or False? The directory you export to is on the DataStage client machine, not on the DataStage server machine.True: Correct! The directory you select for export must be addressable by your client machine.False: Incorrect. The directory you select for export must be addressable by your client machine.

Page 40: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 34

Exporting DataStage Objects

Page 41: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 35

Exporting DataStage Objects

Page 42: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 36

Import Procedure

In Manager, click “Import>DataStage Components”

Select DataStage objects for import

To import DataStage components, click Import>DataStage Components. Select the file to import. Click Import all to begin the import process or Import selected to view a list of the objects in the import file. You can import selected objects from the list. Select the Overwrite without query button to overwrite objects with the same name without warning.

Page 43: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 37

Importing DataStage Objects

Page 44: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 38

Import Options

Page 45: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 39

Exercise

Import DataStage Component (table definition)

Page 46: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 40

Metadata Import

Import format and column destinations from sequential files

Import relational table column destinations

Imported as “Table Definitions”

Table definitions can be loaded into job stages

Table definitions define the formats of a variety of data files and tables. These definitions can then be used and reused in your jobs to specify the formats of data stores. For example, you can import the format and column definitions of the Customers.txt file. You can then load this into the sequential source stage of a job that extracts data from the Customers.txt file. You can load this same metadata into other stages that access data with the same format. In this sense the metadata is reusable. It can be used with any file or data store with the same format. If the column definitions are similar to what you need you can modify the definitions and save the table definition under a new name.You can import and define several different kinds of table definitions including: Sequential files and ODBC data sources.

Page 47: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 41

Sequential File Import Procedure

In Manager, click Import>Table Definitions>Sequential File Definitions

Select directory containing sequential file and then the file

Select Manager category

Examined format and column definitions and edit is necessary

To start the import, click Import>Table Definitions>Sequential File Definitions. The Import Meta Data (Sequential) window is displayed. Select the directory containing the sequential files. The Files box is then populated with the files you can import.Select the file to import. Select or specify a category (folder) to import into. • The format is: <Category>\<Sub-category>• <Category> is the first-level sub-folder under Table Definitions.• <Sub-category> is (or becomes) a sub-folder under the type.

Page 48: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 42

Manager Table Definition

In Manager, select the category (folder) that contains the table definition. Double-click the table definition to open the Table Definition window. Click the Columns tab to view and modify any column definitions. Select the Format tab to edit the file format specification.

Page 49: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 43

Importing Sequential Metadata

Page 50: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 44

Page 51: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 45

IntroPart 4

Designing and Documenting Jobs

Page 52: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 46

Module Objectives

After this module you will be able to:– Describe what a DataStage job is– List the steps involved in creating a job– Describe links and stages– Identify the different types of stages– Design a simple extraction and load job– Compile your job– Create parameters to make your job flexible– Document your job

Page 53: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 47

What Is a Job?

Executable DataStage program

Created in DataStage Designer, but can use components from Manager

Built using a graphical user interface

Compiles into Orchestrate shell language (OSH)

A job is an executable DataStage program. In DataStage, you can design and run jobs that perform many useful data integration tasks, including data extraction, data conversion, data aggregation, data loading, etc. DataStage jobs are:• Designed and built in Designer.• Scheduled, invoked, and monitored in Director.• Executed under the control of DataStage.

Page 54: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 48

Job Development Overview

In Manager, import metadata defining sources and targets

In Designer, add stages defining data extractions and loads

And Transformers and other stages to defined data transformations

Add lakes defining the flow of data from sources to targets

Compiled the job

In Director, validate, run, and monitor your job

In this module, you will go through the whole process with a simple job, except for the first bullet. In this module you will manually define the metadata.

Page 55: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 49

Designer Work Area

The appearance of the designer work space is configurable; the graphic shown here is only one example of how you might arrange components.In the right center is the Designer canvas, where you create stages and links. On the left is the Repository window, which displays the branches in Manager. Items in Manager, such as jobs and table definitions can be dragged to the canvas area. Click View>Repository to display the Repository window.

Page 56: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 50

Designer Toolbar

Provides quick access to the main functions of Designer

Job properties Compile

Show/hide metadata markers

Page 57: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 51

Tools Palette

The tool palette contains icons that represent the components you can add to your job design. You can also install additional stages called plug-ins for special purposes.

Page 58: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 52

Adding Stages and Links

Stages can be dragged from the tools palette or from the stage type branch of the repository view

Links can be drawn from the tools palette or by right clicking and dragging from one stage to another

Page 59: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 53

Sequential File Stage

Used to extract data from, or load data to, a sequential file

Specify full path to the file

Specify a file format: fixed width or delimited

Specified column definitions

Specify write action

The Sequential stage is used to extract data from a sequential file or to load data into a sequential file.The main things you need to specify when editing the sequential file stage are the following:• Path and name of file• File format• Column definitions• If the sequential stage is being used as a target, specify the write action: Overwrite the existing file or append to it.

Page 60: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 54

Job Creation Example Sequence

Brief walkthrough of procedure

Presumes meta data already loaded in repository

Page 61: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 55

Designer - Create New Job

Several types of DataStage jobs:Server – not covered in this course. However, you can create server jobs, convert them to a container, then use this container in a parallel job. However, this has negative performance implications.Shared container (parallel or server) – contains reusable components that can be used by other jobs.Mainframe – DataStage 390, which generates Cobol codeParallel – this course will concentrate on parallel jobs.Job Sequence – used to create jobs that control execution of other jobs.

Page 62: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 56

Drag Stages and Links Using Palette

The tools palette may be shown as a floating dock or placed along a border. Alternatively, it may be hidden and the developer may choose to pull needed stages from the repository onto the design work area.

Page 63: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 57

Assign Meta Data

Meta data may be dragged from the repository and dropped on a link.

Page 64: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 58

Editing a Sequential Source Stage

Any required properties that are not completed will appear in red.

You are defining the format of the data flowing out of the stage, that is, to the outputlink. Define the output link listed in the Output name box.You are defining the file from which the job will read. If the file doesn’t exist, you will get an error at run time. On the Format tab, you specify a format for the source file. You will be able to view its data using the View data button. Think of a link as like a pipe. What flows in one end flows out the other end (at the transformer stage).

Page 65: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 59

Editing a Sequential Target

Defining a sequential target stage is similar to defining a sequential source stage. You are defining the format of the data flowing into the stage, that is, from the inputlinks. Define each input link listed in the Input name box.You are defining the file the job will write to. If the file doesn’t exist, it will be created. Specify whether to overwrite or append the data in the Update action set of buttons. On the Format tab, you can specify a different format for the target file than you specified for the source file. If the target file doesn’t exist, you will not (of course!) be able to view its data until after the job runs. If you click the View data button, DataStage will return a “Failed to open …” error.The column definitions you defined in the source stage for a given (output) link will appear already defined in the target stage for the corresponding (input) link.Think of a link as like a pipe. What flows in one end flows out the other end. The format going in is the same as the format going out.

Page 66: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 60

Transformer Stage

Used to define constraints, derivations, and column mappings

A column mapping maps an input column to an output column

In this module will just defined column mappings (no derivations)

In the Transformer stage you can specify:• Column mappings• Derivations• ConstraintsA column mapping maps an input column to an output column. Values are passed directly from the input column to the output column.Derivations calculate the values to go into output columns based on values in zero or more input columns.Constraints specify the conditions under which incoming rows will be written to output links.

Page 67: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 61

Transformer Stage Elements

There are two: transformer and basic transformer. Both look the same but access different routines and functions.Notice the following elements of the transformer:The top, left pane displays the columns of the input links. The top, right pane displays the contents of the stage variables. The lower, right pane displays the contents of the output link. Unresolved column mapping will show the output in red.For now, ignore the Stage Variables window in the top, right pane. This will be discussed in a later module.The bottom area shows the column definitions (metadata) for the input and output links.

Page 68: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 62

Create Column Mappings

Page 69: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 63

Creating Stage Variables

Stage variables are used for a variety of purposes:CountersTemporary registers for derivationsControls for constraints

Page 70: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 64

Result

Page 71: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 65

Adding Job Parameters

Makes the job more flexible

Parameters can be:– Used in constraints and derivations– Used in directory and file names

Parameter values are determined at run time

Page 72: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 66

Adding Job Documentation

Job Properties– Short and long descriptions– Shows in Manager

Annotation stage– Is a stage on the tool palette– Shows on the job GUI (work area)

Page 73: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 67

Job Properties Documentation

Page 74: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 68

Annotation Stage on the Palette

Two versions of the annotation stage are available:AnnotationAnnotation description

The difference will be evident on the following slides.

Page 75: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 69

Annotation Stage Properties

You can type in whatever you want; the default text comes from the short description of the jobs properties you entered, if any.Add one or more Annotation stages to the canvas to document your job. An Annotation stage works like a text box with various formatting options. You can optionally show or hide the Annotation stages by pressing a button on the toolbar.There are two Annotation stages. The Description Annotation stage is discussed in a later slide.

Page 76: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 70

Final Job Work Area with Documentation

Page 77: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 71

Can be used to Document Specific Stages

To get the documentation to appear at the top of the text box, click the “top” button in the stage properties.

Page 78: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 72

Compiling a Job

Before you can run your job, you must compile it. To compile it, click File>Compile or click the Compile button on the toolbar. The Compile Jobwindow displays the status of the compile.

A compile will generate OSH.

Page 79: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 73

Errors or Successful Message

If an error occurs:Click Show Error to identify the stage where the error occurred. This will highlight the stage in error.Click More to retrieve more information about the error. This can be lengthy for parallel jobs.

Page 80: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 74

Page 81: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 75

IntroPart 5

Running Jobs

Page 82: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 76

Module Objectives

After this module you will be able to:– Validate your job– Use DataStage Director to run your job– Set to run options– Monitor your job’s progress– View job log messages

Page 83: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 77

Prerequisite to Job Execution

Result from Designer compile

Page 84: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 78

DataStage Director

Can schedule, validating, and run jobs

Can be invoked from DataStage Manager or Designer– Tools > Run Director

As you know, you run your jobs in Director. You can open Director from within Designer by clicking Tools>Run Director. In a similar way, you can move between Director, Manager, and Designer.There are two methods for running a job:• Run it immediately.• Schedule it to run at a later time or date.To run a job immediately:• Select the job in the Job Status view. The job must have been compiled.• Click Job>Run Now or click the Run Now button in the toolbar. The Job Run Options window is displayed.

Page 85: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 79

Running Your Job

This shows the Director Status view. To run a job, select it and then click Job>Run Now.

Better yet:Shift to log view from main Director screen. Then click green arrow to execute job.

Page 86: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 80

Run Options – Parameters and Limits

The Job Run Options window is displayed when you click Job>Run Now. This window allows you to stop the job after:• A certain number of rows.• A certain number of warning messages.You can validate your job before you run it. Validation performs some checks that are necessary in order for your job to run successfully. These include:• Verifying that connections to data sources can be made.• Verifying that files can be opened.• Verifying that SQL statements used to select data can be prepared.Click Run to run the job after it is validated. The Status column displays the status of the job run.

Page 87: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 81

Director Log View

Click the Log button in the toolbar to view the job log. The job log records events that occur during the execution of a job. These events include control events, such as the starting, finishing, and aborting of a job; informational messages; warning messages; error messages; and program-generated messages.

Page 88: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 82

Message Details are Available

Page 89: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 83

Other Director Functions

Schedule job to run on a particular date/time

Clear job log

Set Director options– Row limits– Abort after x warnings

Page 90: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 84

Blank

Page 91: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 85

Module 1

DS324PX – DataStage PX

Review

Page 92: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 86

Ascential’s Enterprise Data Integration Platform

CRMERPSCM

RDBMSLegacy

Real-time Client-server Web services

Data WarehouseOther apps.

ANY SOURCE ANY TARGET

CRMERPSCMBI/AnalyticsRDBMSReal-time Client-server Web servicesData WarehouseOther apps.

Command & Control

DISCOVER

Gather relevant information for target enterprise applications

Data Profiling

PREPARE

Data Quality

Cleanse, correct and match input data

TRANSFORM

Extract, Transform, Load

Standardize and enrich data and load to targets

Meta Data Management

Parallel Execution

Ascential Software looked at what’s required across the whole lifecycle to solve enterprise data integration problems not just once for an individual project, but for multiple projects connecting any type of source (ERP, SCM, CRM, legacy systems, flat files, external data, etc) with any type of target.Ascential’s uniqueness comes from the combination of proven best-in-class functionality, across data profiling, data quality and matching, and ETL combined to provide a complete end-to-end data integration solution on a platform with unlimited scalability and performance through parallelization. We can therefore deal not only with gigabytes of data, but terabytes of data, and petabytes of data -data volumes becoming more and more common………., and do so with complete management and integration of all the meta data in the enterprise environment. This is indeed an end-to-end solution, which customers can implement in whatever phases they choose, and which, by virtue of its completeness and breadth and robustness of solution, helps our customers get the quickest possible time to value and time to impact from strategic applications. It assures good data is being fed into the informational systems and that our solution provides the ROI and economic benefit customers expect from investments in strategic applications, which are very large investments and, which done right, should command very large returns.

Page 93: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 87

Course Objectives

You will learn to:– Build DataStage PX jobs using complex logic– Utilize parallel processing techniques to increase job

performance– Build custom stages based on application needs

Course emphasis is:– Advanced usage of DataStage PX– Application job development– Best practices techniques

Students will gain hands-on experience with DSPX by building an application in class exercises. This application will use complex logic structures.

Page 94: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 88

Course Agenda

Day 1– Review of PX Concepts– Sequential Access– Standards– DBMS Access

Day 2– PX Architecture– Transforming Data– Sorting Data

Day 3– Combining Data– Configuration Files

Day 4– Extending PX– Meta Data Usage– Job Control– Testing

Suggested agenda – actual classes will vary.

Page 95: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 89

Module Objectives

Provide a background for completing work in the DSPX advanced course

Ensure all students will have a successful advanced class

Tasks– Review concepts covered in DSPX Essentials course

In this module we will review many of the concepts and ideas presented in the DataStage essentials course. At the end of this module students will be asked to complete a brief exercise demonstrating their mastery of that basic material.

Page 96: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 90

Review Topics

DataStage architecture

DataStage client review– Administrator– Manager– Designer– Director

Parallel processing paradigm

DataStage Parallel Extender (DSPX)

Our topics for review will focus on overall DataStage architecture and a review of the parallel processing paradigm in DataStage parallel extender.

Page 97: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 91

Microsoft® Windows NT or UNIX

Designer Director RepositoryManagerAdministrator

Extract Cleanse Transform IntegrateDiscover Prepare Transform Extend

Parallel Execution

Meta Data Management

Command & Control

Microsoft® Windows NT/2000/XP

ANY SOURCE ANY TARGET

CRMERPSCMBI/AnalyticsRDBMSReal-Time Client-server Web servicesData WarehouseOther apps.

Server Repository

Client-Server Architecture

DataStage parallel extender consists of four clients connected to the DataStage parallel extender engine. The client applications are: Manager, Administrator, Designer, and Director. These clients connect to a server engine that is located on either a UNIX (parallel extender only) or NT operating system.

Page 98: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 92

Process Flow

Administrator – add/delete projects, set defaults

Manager – import meta data, backup projects

Designer – assemble jobs, compile, and execute

Director – execute jobs, examine job run logs

A typical DataStage workflow consists of:Setting up the project in AdministratorIncluding metadata via ManagerBuilding and assembling the job in DesignerExecuting and testing the job in Director.

Page 99: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 93

Administrator – Licensing and Timeout

Change licensing, if appropriate.Timeout period should be set to large number or choose “do not timeout” option.

Page 100: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 94

Administrator – Project Creation/Removal

Functions specific to a

project.

Available functions:Add or delete projects.Set project defaults (properties button).Cleanup – perform repository functions.Command – perform queries against the repository.

Page 101: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 95

Administrator – Project Properties

RCP for parallel jobs should be

enabled

Variables for parallel

processing

Recommendations:Check enable job administration in DirectorCheck enable runtime column propagationMay check auto purge of jobs to manage messages in director log

Page 102: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 96

Administrator – Environment Variables

Variables are category specific

You will see different environment variables depending on which category is selected.

Page 103: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 97

OSH is what is run by the PX Framework

Reading OSH will be covered in a later module. Since DataStage parallel extender writes OSH, you will want to check this option.

Page 104: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 98

DataStage Manager

To attach to the DataStage Manager client, one first enters through the logon screen. Logons can be either by DNS name or IP address.Once logged onto Manager, users can import meta data; export all or portions of the project, or import components from another project’s export.Functions:

Backup projectExport Import

Import meta dataTable definitionsSequential file definitionsCan be imported from metabrokers

Register/create new stages

Page 105: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 99

Export Objects to MetaStage

Push meta data to

MetaStage

DataStage objects can now be pushed from DataStage to MetaStage.

Page 106: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 100

Designer Workspace

Can execute the job from

Designer

Job design process:Determine data flow

Import supporting meta dataUse designer workspace to create visual representation of jobDefine properties for all stagesCompileExecute

Page 107: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 101

DataStage Generated OSH

The PX Framework runs OSH

The DataStage GUI now generates OSH when a job is compiled. This OSH is then executed by the parallel extender engine.

Page 108: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 102

Director – Executing Jobs

Messages from previous run in different color

Messages from previous runs are kept in different color from current run.

Page 109: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 103

Stages

Can now customize the Designer’s palette

In DesignerView > Customize paletteThis window will allow you to move icons into your Favorites folder plus many other customization features.

Page 110: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 104

Popular Stages

Row generator

Peek

The row generator and peek stages are especially useful during development.

Page 111: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 105

Row Generator

Can build test data

Repeatable property

Edit row in column tab

Depending on the type of data, you can set values for each column in the row generator.

Page 112: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 106

Peek

Displays field values– Will be displayed in job log or sent to a file– Skip records option– Can control number of records to be displayed

Can be used as stub stage for iterative development (more later)

The peek stage will display column values in a job's output messages log.

Page 113: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 107

Why PX is so Effective

Parallel processing paradigm– More hardware, faster processing– Level of parallelization is determined by a configuration

file read at runtime

Emphasis on memory– Data read into memory and lookups performed like

hash table

PX takes advantage of the machines hardware architecture -- this can be changed at runtime.

Page 114: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 108

Parallel processing = executing your application on multiple CPUs– Scalable processing = add more resources

(CPUs, RAM, and disks) to increase system performance

• Example system containing6 CPUs (or processing nodes)and disks

1 2

3 4

5 6

Scalable Systems

DataStage parallel extender can take advantage of multiple processing nodes to instantiate multiple instances of a DataStage job.

Page 115: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 109

Three main types of scalable systems

Symmetric Multiprocessors (SMP), shared memory

–Sun Starfire™–IBM S80–HP Superdome

Clusters: UNIX systems connected via networks

–Sun Cluster–Compaq TruCluster

MPPnote

Scaleable Systems: Examples

You can describe an MPP as a bunch of connected SMPs.

Page 116: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 110

• Multiple CPUs with a single operating system• Programs communicate using shared memory• All CPUs share system resources

(OS, memory with single linear address space, disks, I/O)

When used with Parallel Extender:• Data transport uses shared memory• Simplified startup

cpu cpu

cpu cpu

Parallel Extender treats NUMA (NonUniform Memory Access) as plain SMP

SMP: Shared Everything

A typical SMP machine has multiple CPUs that share both disks and memory.

Page 117: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 111

Source

Transform

Target

Data Warehouse

Operational Data

Archived Data

Clean Load

Disk Disk Disk

Traditional approach to batch processing:• Write to disk and read from disk before each processing operation• Sub-optimal utilization of resources

• a 10 GB stream leads to 70 GB of I/O• processing resources can sit idle during I/O

• Very complex to manage (lots and lots of small jobs)• Becomes impractical with big data volumes

• disk I/O consumes the processing• terabytes of disk required for temporary staging

Traditional Batch Processing

The traditional data processing paradigm involves dropping data to disk many times throughout a processing run.

Page 118: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 112

Data Pipelining• Transform, clean and load processes are executing simultaneously on the same processor• rows are moving forward through the flow

Source

Transform

Target

Data Warehouse

Operational Data

Archived Data Clean Load

• Start a downstream process while an upstream process is still running.• This eliminates intermediate storing to disk, which is critical for big data.• This also keeps the processors busy.• Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

Pipeline Multiprocessing

On the other hand, but parallel processing paradigm rarely drops data to disk unless necessary for business reasons -- such as backup and recovery.

Page 119: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 113

Data Partitioning

Transform

SourceData

Transform

Transform

Transform

Node 1

Node 2

Node 3

Node 4

A-F

G- M

N-T

U-Z

• Break up big data into partitions

• Run one partition on each processor

• 4X times faster on 4 processors -With data big enough:

100X faster on 100 processors

• This is exactly how the paralleldatabases work!

• Data Partitioning requires thesame transform to all partitions:Aaron Abbott and Zygmund Zornundergo the same transform

Partition Parallelism

Data may actually be partitioned in several ways -- range partitioning is only one example. We will explore others later.

Page 120: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 114

• Dataset: uniform set of rows in the Framework's internal representation - Three flavors:

1. file sets *.fs : stored on multiple Unix files as flat files 2. persistent: *.ds : stored on multiple Unix files in Framework format

read and written using the DataSet Stage 3. virtual: *.v : links, in Framework format, NOT stored on disk

- The Framework processes only datasets—hence possible need for Import - Different datasets typically have different schemas- Convention: "dataset" = Framework data set.

• Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).

- All the partitions of a dataset follow the same schema: that of the dataset

PX Program Elements

Parallel extender deals with several different types and data:file setsdata sets -- both in persistent and non-persistent forms.

Page 121: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 115

Putting It All Together: Parallel Dataflow

Source Target

Transform Clean Load

PipeliningPa

rtitio

ning

SourceData

Data Warehouse

Combining Parallelism Types

Pipelining and partitioning can be combined together to provide a powerful parallel processing paradigm.

Page 122: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 116

Putting It All Together: Parallel Dataflow with Repartioning on-the-fly

Without Landing To Disk!

Source Target

Transform Clean Load

Pipelining

SourceData Data

Warehouse

Parti

tioni

ng

Repa

rtitio

ning

A-FG- M

N-TU-Z

Customer last name Customer zip code Credit card number

Repa

rtitio

ning

Repartitioning

In addition, data can change partitioning from stage to stage. This can either happened explicitly at the desire of a programmer or implicitly performed by the engine.

Page 123: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 117

• Build scalable applications from simple components• Run native operators developed in C++• Run existing code written in C, COBOL, scripts, etc.• Run existing sequential

programs in parallel• Parallel Extender will

manage parallel execution Shellscripts

UNIX UNIX UNIX UNIX UNIX

The Framework

Scalable Servers...

C++operators

Legacy and 3rdparty programs

COBOL, C/C++, etc

Parallel Extender Client

DSPX Overview

Parallel extender can interface with UNIX commands, shell scripts, and legacy programs to run on the parallel extender framework. Underlying this architecture is the UNIX operating system.

Page 124: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 118

Orchestrate Program(sequential data flow)

Orchestrate Application Frameworkand Runtime System

Import

Clean 1

Clean 2

Merge Analyze

Configuration FileCentralized Error Handling

and Event Logging

Parallel access to data in files

Parallel access to data in RDBMS

Inter-node communications

Parallel pipelining

Parallelization of operations

Import

Clean 1

Merge Analyze

Clean 2

Relational Data

PerformanceVisualization

Flat Files

Orchestrate Framework:Provides application scalability

DataStage Parallel Extender:Best-of-breed scalable data integration platformNo limitations on data volumes or throughput

DataStage PX Architecture

DataStage:Provides data integration platform

The parallel extender engine was derived from DataStage and Orchestrate.

Page 125: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 119

Orchestrate DataStage

schema table definition

property format

type SQL type + length [and scale]

virtual dataset link

record/field row/column

operator stage

step, flow, OSH command job

Framework DS engine

• GUI uses both terminologies• Program messages (warnings, errors) use Orchestrate terms

note

Note:

Different uses of

"import" and

"constraint"

Glossary

Orchestrate terms and DataStage terms have an equivalency. The GUI can log messages frequently used terms from both paradigms.

Page 126: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 120

DSPX:– Automatically scales to fit the machine– Handles data flow among multiple CPU’s and disks

With DSPX you can:– Create applications for SMP’s, clusters and MPP’s…

Parallel Extender is architecture-neutral– Access relational databases in parallel– Execute external applications in parallel– Store data across multiple disks and nodes

Introduction to DataStage PX

Parallel extender is architecturally neutral -- it can run on SMP's, clusters, and MPP's. The configuration file determines how parallel extender will treat the hardware.

Page 127: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 121

User assembles the flow using the Designer (see below)

…and gets: parallel access, propagation, transformation, and load.The design is good for 1 node, 4 nodes,

or N nodes. To change # nodes, just swap configuration file. No need to modify or recompile your design!

Job Design VS. Execution

Much of the parallel processing paradigm is hidden from the programmer -- they simply designate process flow as shown in the upper portion of this diagram. Parallel extender, using the definitions in that configuration file, will actually execute UNIX processes that are partitioned and parallelized.

Page 128: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 122

Partitioners distribute rows into partitions– implement data-partition parallelism

Collectors = inverse partitioners

Live on input links of stages running – in parallel (partitioners)– sequentially (collectors)

Use a choice of methods

DS-PX Program ElementsPartitioners and Collectors

Partitioners and collectors work in opposite directions -- however, many are frequently together in job designs.

Page 129: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 123

Example Partitioning Icons

partitioner

Partitioners and collectors have no stage nor icons of their own. They live live on input links of stages running in parallel (resp. sequentially). Link markings indicate their presence.

S----------------->S (no Marking)S----(fan out)--->P (partitioner)P----(fan in) ---->S (collector)P----(box)------->P (no reshuffling: partitioner using "SAME" method)P----(bow tie)--->P (reshuffling: partitioner using another method)

Collectors = inverse partitioners recollect rows from partitions into a single input stream to a sequential stage

They are responsible for some surprising behavior: The default (Auto) is "eager" to output rows and typically causes non-determinism: row order may vary from run to run with identical input.

Page 130: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 124

Exercise

Complete exercises 1-1 and 1-2, and 1-3

Page 131: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 125

Module 2

DSPX Sequential Access

Page 132: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 126

Module Objectives

You will learn to:– Import sequential files into the PX Framework– Utilize parallel processing techniques to increase

sequential file access– Understand usage of the Sequential, DataSet, FileSet,

and LookupFileSet stages– Manage partitioned data stored by the Framework

In this module students will concentrate their efforts on sequential file access in parallel extender jobs.

Page 133: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 127

Types of Sequential Data Stages

Sequential– Fixed or variable length

File Set

Lookup File Set

Data Set

Several stages handle sequential data. Each stage as both advantages and differences from the other stages that handle sequential data.Sequential data can come in a variety of types -- including both fixed length and variable length.

Page 134: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 128

The PX Framework processes only datasets

For files other than datasets, such as flat files, Parallel Extender must perform import and export operations – this is performed by import and export OSH operators (generated by Sequential or FileSet stages)

During import or export DataStage performs format translations – into, or out of, the PX internal format

Data is described to the Framework in a schema

Sequential Stage Introduction

The DataStage sequential stage writes OSH – specifically the import and export Orchestrate operators.Q: Why import data into an Orchestrate data set?A: Partitioning works only with data sets. You must use data sets to distribute data to the multiple processing nodes of a parallel system.Every Orchestrate program has to perform some type of import operation, from: a flat file, COBOL data, an RDBMS, or a SAS data set. This section describes how to get your data into Orchestrate.Also talk about getting your data back out. Some people will be happy to leave data in Orchestrate data sets, while others require their results in a different format.

Page 135: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 129

How the Sequential Stage Works

Generates Import/Export operators

Types of transport– Performs direct C++ file I/O streams– Source programs which feed stdout (gunzip) send

stdout into PX via sequential pipe

Behind each parallel stage is one or more Orchestrate operators. Import and Export are both operators that deal with sequential data.

Page 136: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 130

Using the Sequential File Stage

Importing/Exporting Data

Both import and export of general files (text, binary) are performed by the SequentialFile Stage.

– Data import:

– Data export

PX internal format

PX internal format

When data is imported the imported operator translates that data into the parallel extender internal format. The export operator performs the reverse action.

Page 137: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 131

Working With Flat Files

Sequential File Stage– Normally will execute in sequential mode– Can execute in parallel if reading multiple files (file

pattern option)– Can use multiple readers within a node on fixed width

file– DSPX needs to know

How file is divided into rowsHow row is divided into columns

Both export and import operators are generated by the sequential stage -- which one you get depends on whether the sequential stage is used as source or target.

Page 138: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 132

Processes Needed to Import Data

Recordization– Divides input stream into records– Set on the format tab

Columnization– Divides the record into columns– Default set on the format tab but can be overridden on

the columns tab– Can be “incomplete” if using a schema or not even

specified in the stage if using RCP

These two processes must want together to correctly interpret data -- that is, to break a data string down into records and columns.

Page 139: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 133

File Format Example

Field 1

Field 1

Field 1

Field 1

Field 1

Field 1

,

,

,

,

,

,

Last field

Last field

nl

nl,

Field Delimiter

Final Delimiter = comma

Final Delimiter = end

Record delimiter

Fields all columns are defined my delimiters. Similarly, records are defined by terminating characters.

Page 140: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 134

Sequential File Stage

To set the properties, use stage editor– Page (general, input/output)– Tabs (format, columns)

Sequential stage link rules– One input link– One output links (except for reject link definition)– One reject link

Will reject any records not matching meta data in the column definitions

The DataStage GUI allows you to determine properties that will be used to read and write sequential files.

Page 141: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 135

Job Design Using Sequential Stages

Stage categories

Source stageMultiple output links - however, note that one of the links is

represented by a broken line. This is a reject link, not to be confused with a stream link or a reference link.Target

One input link

Page 142: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 136

General Tab – Sequential Source

Multiple output links Show records

If multiple links are present you'll need to down-click to see each.

Page 143: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 137

Properties – Multiple Files

Click to add more files having the same meta data.

If specified individually, you can make a list of files that are unrelated in name.If you select “read method” and choose file pattern, you effectively select an undetermined number of files.

Page 144: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 138

Properties - Multiple Readers

Multiple readers option allows you to set number of readers

To use multiple readers on a sequential file, must be fixed-length records.

Page 145: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 139

Format Tab

File into records Record into columns

DXPX needs to know:How a file is divided into rowsHow a row is divided into columns

Column properties set on this tab are defaults for each column; they can be overridden at the column level (from columns tab).

Page 146: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 140

Read Methods

Page 147: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 141

Reject Link

Reject mode = output

Source– All records not matching the meta data (the column

definitions)

Target– All records that are rejected for any reason

Meta data – one column, datatype = raw

The sequential stage can have a single reject link. This is typically used when you are writing to a file and provides a location where records that have failed to be written to a file for some reason can be sent. When you are reading files, you can use a reject link as a destination for rows that do not match the expected column definitions.

Page 148: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 142

File Set Stage

Can read or write file sets

Files suffixed by .fs

File set consists of:1. Descriptor file – contains location of raw data files +

meta data2. Individual raw data files

Can be processed in parallel

Number of raw data files depends on: the configuration file – more on configuration files later.

Page 149: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 143

File Set Stage Example

Descriptor file

The descriptor file shows both a record is metadata and the file's location. The location is determined by the configuration file.

Page 150: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 144

File Set Usage

Why use a file set?– 2G limit on some file systems– Need to distribute data among nodes to prevent

overruns– If used in parallel, runs faster that sequential file

File sets, lower yielding faster access and simple text files, are not in the parallel extender internal format.

Page 151: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 145

Lookup File Set Stage

Can create file sets

Usually used in conjunction with Lookup stages

The lookup file set is similar to the file set but also contains information about the key columns. These keys will be used later in lookups.

Page 152: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 146

Lookup File Set > Properties

Key column specified

Key column dropped in

descriptor file

Page 153: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 147

Data Set

Operating system (Framework) file

Suffixed by .ds

Referred to by a control file

Managed by Data Set Management utility from GUI (Manager, Designer, Director)

Represents persistent data

Key to good performance in set of linked jobs

Data sets represent persistent data maintained in the internal format.

Page 154: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 148

Persistent Datasets

Accessed from/to disk with DataSet Stage.

Two parts: – Descriptor file:

contains metadata, data location, but NOT the data itself

– Data file(s) contain the datamultiple Unix files (one per node), accessible in parallel

input.ds

node1:/local/disk1/…node2:/local/disk2/…

record (partno: int32;description: string;)

Accessed from/to disk with DataSet Stage. Two parts: Descriptor file

User-specified nameContains table definition ("unformatted core" schema)

Here is the icon of the DataSet Stage, used to access persistent datsets

Descriptor file, e.g., "input.ds" contains: •Paths of data files•Metadata: unformatted table definition, no formats (unformatted schema)•Config file used to store the data

Data file(s)Contain the data itselfSystem-generated long file names, to avoid naming conflicts.

Page 155: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 149

Quiz!

• True or False?Everything that has been data-partitioned must be

collected in same job

For example, if we have four nodes corresponding to four regions, we'll have four reports. No need to recollect if one does not need inter-region correlations.

Page 156: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 150

Data Set Stage

Is the data partitioned?

In both cases the answer is Yes.

Page 157: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 151

Engine Data Translation

Occurs on import– From sequential files or file sets– From RDBMS

Occurs on export– From datasets to file sets or sequential files– From datasets to RDBMS

Engine most efficient when processing internally formatted records (I.e. data contained in datasets)

Page 158: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 152

Managing DataSets

GUI (Manager, Designer, Director) – tools > data set management

Alternative methods – Orchadmin

Unix command line utilityList recordsRemove data sets (will remove all components)

– DsrecordsLists number of records in a dataset

Both dsrecords and orchadmin are Unix command-line utilities.

The DataStage Designer GUI provides a mechanism to view and manage data sets.

Page 159: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 153

Data Set Management

Display data

Schema

The screen is available (data sets management) from manager, designer, and director.

Page 160: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 154

Data Set Management From Unix

Alternative method of managing file sets and data sets– Dsrecords

Gives record count– Unix command-line utility– $ dsrecords ds_nameI.e.. $ dsrecords myDS.ds

156999 records

– OrchadminManages PX persistent data sets

– Unix command-line utilityI.e. $ orchadmin rm myDataSet.ds

Page 161: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 155

Exercise

Complete exercises 2-1, 2-2, 2-3, and 2-4.

Page 162: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 156

Blank

Page 163: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 157

Module 3

Standards and Techniques

Page 164: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 158

Objectives

Establish standard techniques for DSPX development

Will cover:– Job documentation– Naming conventions for jobs, links, and stages– Iterative job design– Useful stages for job development – Using configuration files for development– Using environmental variables– Job parameters

Page 165: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 159

Job Presentation

Document using the annotation stage

Page 166: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 160

Job Properties Documentation

Description shows in DS Manager and MetaStage

Organize jobs into categories

Page 167: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 161

Naming conventions

Stages named after the – Data they access– Function they perform– DO NOT leave defaulted stage names like

Sequential_File_0

Links named for the data they carry– DO NOT leave defaulted link names like DSLink3

Page 168: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 162

Stage and Link Names

Stages and links renamed to data they

handle

Page 169: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 163

Create Reusable Job Components

Use parallel extender shared containers when feasible

Container

Page 170: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 164

Use Iterative Job Design

Use copy or peek stage as stub

Test job in phases – small first, then increasing in complexity

Use Peek stage to examine records

Page 171: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 165

Copy or Peek Stage Stub

Copy stage

Page 172: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 166

Transformer StageTechniques

Suggestions -– Always include reject link.– Always test for null value before using a column in a

function. – Try to use RCP and only map columns that have a

derivation other than a copy. More on RCP later.– Be aware of Column and Stage variable Data Types.

Often user does not pay attention to Stage Variable type.

– Avoid type conversions.Try to maintain the data type as imported.

Page 173: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 167

The Copy Stage

With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):– Partitioners– Sort / Remove Duplicates– Rename, Drop column

… can be inserted on: – input link (Partitioning): Partitioners, Sort, Remove Duplicates)– output link (Mapping page): Rename, Drop.

Sometimes replace the transformer:R

Page 174: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 168

Developing Jobs

1. Keep it simple• Jobs with many stages are hard to debug and maintain.

2. Start small and Build to final Solution• Use view data, copy, and peek. • Start from source and work out.• Develop with a 1 node configuration file.

3. Solve the business problem before the performance problem.• Don’t worry too much about partitioning until the sequential flow

works as expected.

4. If you have to write to Disk use a Persistent Data set.

Page 175: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 169

Final Result

Page 176: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 170

Good Things to Have in each Job

Use job parameters

Some helpful environmental variables to add to job parameters– $APT_DUMP_SCORE

Report OSH to message log

– $APT_CONFIG_FILEEstablishes runtime parameters to PX engine; I.e. Degree of parallelization

Page 177: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 171

Setting Job Parameters

Click to add environment

variables

Page 178: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 172

DUMP SCORE Output

Double-click

Mapping Node--> partition

Setting APT_DUMP_SCORE yields:

PartitonerAnd

Collector

Page 179: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 173

Use Multiple Configuration Files

Make a set for 1X, 2X,….

Use different ones for test versus production

Include as a parameter in each job

Page 180: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 174

Exercise

Complete exercise 3-1

Page 181: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 175

Module 4

DBMS Access

Page 182: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 176

Objectives

Understand how DSPX reads and writes records to an RDBMS

Understand how to handle nulls on DBMS lookup

Utilize this knowledge to:– Read and write database tables– Use database tables to lookup data– Use null handling options to clean data

Page 183: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 177

Parallel Database Connectivity

TraditionalClient-Server Parallel Extender

Sort

Client

Parallel RDBMS

Client

Client

Client

Client

Parallel RDBMS

Only RDBMS is running in parallelEach application has only one connectionSuitable only for small data volumes

Parallel server runs APPLICATIONSApplication has parallel connections to RDBMSSuitable for large data volumesHigher levels of integration possible

Client

Load

Page 184: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 178

RDBMS AccessSupported Databases

Parallel Extender provides high performance / scalable interfaces for:

DB2

Informix

Oracle

Teradata

Users must be granted specific privileges, depending on RDBMS.

Page 185: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 179

RDBMS access is relatively easy because Orchestrate extracts the schema definition for the imported data set. Litle or no work is required from the user.

Automatically convert RDBMS table layouts to/from Parallel Extender Table Definitions

RDBMS nulls converted to/from nullable field values

Support for standard SQL syntax for specifying:– field list for SELECT statement– filter for WHERE clause– open command, close command

Can write an explicit SQL query to access RDBMSPX supplies additional information in the SQL query

RDBMS AccessSupported Databases

Page 186: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 180

RDBMS Stages

DB2/UDB Enterprise

Informix Enterprise

Oracle Enterprise

Teradata Enterprise

ODBC

Page 187: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 181

RDBMS Usage

As a source– Extract data from table (stream link)

– Extract as table, generated SQL, or user-defined SQL– User-defined can perform joins, access views

– Lookup (reference link)– Normal lookup is memory-based (all table data read into memory)– Can perform one lookup at a time in DBMS (sparse option)– Continue/drop/fail options

As a target– Inserts– Upserts (Inserts and updates)– Loader

Page 188: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 182

RDBMS Source – Stream Link

Stream link

Page 189: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 183

DBMS Source - User-defined SQL

Columns in SQL statement must match the meta data

in columns tab

Page 190: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 184

Exercise

User-defined SQL– Exercise 4-1

Page 191: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 185

DBMS Source – Reference Link

Reject link

Page 192: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 186

Lookup Reject Link

“Output” option automatically creates the reject link

All columns from the input link will be placed on the rejects link. Therefore, no column tab is available for the rejects link.

Page 193: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 187

Null Handling

Must handle null condition if lookup record is not found and “continue” option is chosen

Can be done in a transformer stage

Page 194: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 188

Lookup Stage Mapping

Link name

The mapping tab will show all columns from the input link and the reference link (less the column used for key lookup).

Page 195: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 189

Lookup Stage Properties

Reference link

Must have same column name in input and reference links. You will get the results of the lookup

in the output column.

If the lookup results in a non-match and the action was set to continue, the output column will be null.

Page 196: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 190

DBMS as a Target

Page 197: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 191

DBMS As Target

Write Methods– Delete– Load– Upsert– Write (DB2)

Write mode for load method– Truncate– Create– Replace– Append

Page 198: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 192

Target Properties

Upsert mode determines options

Generated code can be copied

Page 199: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 193

Checking for Nulls

Use Transformer stage to test for fields with null values (Use IsNull functions)

In Transformer, can reject or load default value

Page 200: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 194

Exercise

Complete exercise 4-2

Page 201: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 195

Module 5

Platform Architecture

Page 202: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 196

Objectives

Understand how parallel extender Framework processes data

You will be able to:– Read and understand OSH– Perform troubleshooting

Page 203: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 197

Concepts

The PX Platform

OSH (generated by DataStage Parallel Canvas, and run by DataStage Director)

Conductor,Section leaders,players.

Configuration files (only one active at a time, describes H/W)

Schemas/tables

Schema propagation/RCP

Buildop,Wrapper

Datasets (data in Framework's internal representation)

Page 204: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 198

Output Data Set schema:prov_num:int16;member_num:int8;custid:int32;

Input Data Set schema:prov_num:int16;member_num:int8;custid:int32;

PX Stages Involve A Series Of Processing Steps

I npu tInterface

Partitioner

Business Logic

Output

Interface

PX Stage

• Piece of Application Logic Running AgainstIndividual Records

• Parallel or Sequential

• Three Sources– Ascential Supplied– Commercial tools/applications– Custom/Existing programs

DS-PX Program Elements

Each Orchestrate component has input and output interface schemas, a partitioner and business logic. Interface schemas define the names and data types of the required fields of the component’s input and output Data Sets. The component’s input interface schema requires that an input Data Set have the named fields and data types exactly compatible with those specified by the interface schema for the input Data Set to be accepted by the component. A component ignores any extra fields in a Data Set, which allows the component to be used with any data set that has at least the input interface schema of the component. This property makes it possible to add and delete fields from a relational databasetable or from the Orchestrate Data Set without having to rewrite code inside the component.In the example shown here, Component has an interface schema that requires three fields with named fields and data types as shown in the example. In this example, the output schema for the component is the same as the input schema. This does not always have to be the case.

The partitioner is key to Orchestrate’s ability to deliver parallelism and unlimited scalability. We’ll discuss exactly how the partitioners work in a few slides, but here it’s important to point out that partitioners are an integral part of Orchestrate components.

Page 205: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 199

• PX Delivers Parallelism in Two Ways– Pipeline– Partition

• Block Buffering Between Components – Eliminates Need for Program

Load Balancing– Maintains Orderly Data Flow

Pipeline

Partition

Dual Parallelism Eliminates Bottlenecks!

Producer

Consumer

DS-PX Program ElementsStage Execution

There is one more point to be made about the Orchestrate execution model.Orchestrate achieves parallelism in two ways. We have already talked about partitioning the records and running multiple instances of each component to speed up program execution. In addition to this partition parallelism, Orchestrate is also executing pipeline parallelism. As shown in the picture on the left, as the Orchestrate program is executing, a producer component is feeding records to a consumer component without first writing the records to disk. Orchestrate is pipelining the records forward in the flow as they are being processed by each component. This means that the consumer component is processing records fed to it by the producer component before the producer has finished processing all of the records. Orchestrate provides block buffering between components so that producers cannot produce records faster than consumers can consume those records.This pipelining of records eliminates the need to store intermediate results to disk, which can provide significant performance advantages, particularly when operating against large volumes of data.

Page 206: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 200

Stages Control Partition Parallelism

Execution Mode (sequential/parallel) is controlled by Stage

– default = parallel for most Ascential-supplied Stages– User can override default mode– Parallel Stage inserts the default partitioner (Auto) on its input links– Sequential Stage inserts the default collector (Auto) on its input links– user can override default

execution mode (parallel/sequential) of Stage (Advanced tab)choice of partitioner/collector (Input – Partitioning Tab)

Page 207: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 201

How Parallel Is It?

Degree of Parallelism is determinedby the configuration file

– Total number of logical nodes in default pool, or a subset if using "constraints".

Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage

Page 208: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 202

OSH

DataStage PX GUI generates OSH scripts– Ability to view OSH turned on in Administrator– OSH can be viewed in Designer using job properties

The Framework executes OSH

What is OSH?– Orchestrate shell– Has a UNIX command-line interface

Page 209: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 203

OSH Script

An osh script is a quoted string which specifies:– The operators and connections of a single Orchestrate

step – In its simplest form, it is:

osh “op < in.ds > out.ds”

Where:– op is an Orchestrate operator– in.ds is the input data set– out.ds is the output data set

Page 210: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 204

OSH Operators

Operator is an instance of a C++ class inheriting from APT_Operator

Developers can create new operators

Examples of existing operators:– Import– Export– RemoveDups

Operators are the basic functional units of an Orchestrate application. Operatorsread records from input data sets, perform actions on the input records, and writeresults to output data sets. An operator may perform an action as simple ascopying records from an input data set to an output data set withoutmodification. Alternatively, an operator may modify a record by adding,removing, or modifying fields during execution.Operators are the basic functional units of an Orchestrate application. Operatorsread records from input data sets, perform actions on the input records, and writeresults to output data sets. An operator may perform an action as simple ascopying records from an input data set to an output data set withoutmodification. Alternatively, an operator may modify a record by adding,removing, or modifying fields during execution.Operators are the basic functional units of an Orchestrate application. Operators read records from input data sets, perform actions on the input records, and write results to output data sets. An operator may perform an action as simple as copying records from an input data set to an output data set without modification. Alternatively, an operator may modify a record by adding,removing, or modifying fields during execution.

Page 211: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 205

Enable Visible OSH in Administrator

Will be enabled for all projects

Page 212: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 206

View OSH in Designer

Schema

Operator

Page 213: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 207

OSH Practice

Exercise 5-1

Page 214: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 208

Orchestrate May Add Operators to Your Command

$ osh " echo 'Hello world!' [par] > outfile "

Let’s revisit the following OSH command:

3)collector

4)export(from datasetto Unix flat file)

2)partitioner

1)“wrapper”(turning aUnix command intoan DS/PX Operator)

The Framework silently inserts operators (steps 1,2,3,4)

Page 215: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 209

• Step: unit of OSH program– one OSH command = one step

– at end of step: synchronization, storage to disk

• Datasets: set of rows processed by Framework– Orchestrate data sets:

– persistent (terminal) *.ds, and

– virtual (internal) *.v.

– Also: flat “file sets” *.fs

• Schema: data description (metadata) for datasets and links.

Steps, with internal and terminal datasets and links, described by schemas

Elements of a Framework Program

Page 216: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 210

• Consist of Partitioned Data and Schema• Can be Persistent (*.ds) or Virtual (*.v, Link)• Overcome 2 GB File Limit

=

What you program: What gets processed:

. . .

Multiple files per partitionEach file up to 2GBytes (or larger)

Operator A

Operator A

Operator A

Operator A

Node 1 Node 2 Node 3 Node 4

data filesof x.ds

$ osh “operator_A > x.ds“

GUI

OSH

Orchestrate Datasets

What gets generated:

Page 217: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 211

Computing Architectures: Definition

Clusters and MPP Systems

Shared Disk Shared Nothing

Uniprocessor

Dedicated Disk

• IBM, Sun, HP, Compaq• 2 to 64 processors• Majority of installations

Shared Memory

SMP System(Symmetric Multiprocessor)

DiskDisk

CPU

Memory

CPU CPU CPU

• PC• Workstation• Single processor server

CPU

• 2 to hundreds of processors• MPP: IBM and NCR Teradata• each node is a uniprocessor or SMP

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory

CPU

Disk

Memory

Page 218: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 212

Job Execution:Orchestrate

Conductor Node

C

Processing Node

SL

PP P

SL

PP P

Processing Node

• Conductor - initial DS/PX process– Step Composer– Creates Section Leader processes (one/node)– Consolidates massages, outputs them– Manages orderly shutdown.

• Section Leader – Forks Players processes (one/Stage)– Manages up/down communication.

• Players– The actual processes associated with Stages– Combined players: one process only– Send stderr to SL– Establish connections to other players for data flow– Clean up upon completion.

• Communication:- SMP: Shared Memory- MPP: TCP

Page 219: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 213

Working with Configuration Files

You can easily switch between config files:'1-node' file - for sequential execution, lighter reports—handy for testing 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelism'BigN-nodes' file - aims at full data-partitioned parallelism

Only one file is active while a step is runningThe Framework queries (first) the environment variable:

$APT_CONFIG_FILE

# nodes declared in the config file needs not match # CPUsSame configuration file can be used in development and target machines

Page 220: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 214

SchedulingNodes, Processes, and CPUs

DS/PX does not: – know how many CPUs are available– schedule

Who knows what?

Who does what?– DS/PX creates (Nodes*Ops) Unix processes– The O/S schedules these processes on the CPUs

Nodes = # logical nodes declared in config. fileOps = # ops. (approx. # blue boxes in V.O.)Processes = # Unix processesCPUs = # available CPUs

Nodes Ops Processes CPUsUser Y N

Orchestrate Y Y Nodes * Ops N

O/S " Y

Page 221: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 215

{node "n1" {

fastname "s1"pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {}resource scratchdisk "/temp" {"sort"}

}node "n2" {

fastname "s2"pool "" "n2" "s2" "app1"resource disk "/orch/n2/d1" {}resource disk "/orch/n2/d2" {}resource scratchdisk "/temp" {}

}node "n3" {

fastname "s3"pool "" "n3" "s3" "app1"resource disk "/orch/n3/d1" {}resource scratchdisk "/temp" {}

}node "n4" {

fastname "s4"pool "" "n4" "s4" "app1"resource disk "/orch/n4/d1" {}resource scratchdisk "/temp" {}

}

1

43

2

Configuring DSPX – Node Pools

Page 222: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 216

{node "n1" {

fastname "s1"pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {}resource disk "/orch/n1/d2" {"bigdata"}resource scratchdisk "/temp" {"sort"}

}node "n2" {

fastname "s2"pool "" "n2" "s2" "app1"resource disk "/orch/n2/d1" {}resource disk "/orch/n2/d2" {"bigdata"}resource scratchdisk "/temp" {}

}node "n3" {

fastname "s3"pool "" "n3" "s3" "app1"resource disk "/orch/n3/d1" {}resource scratchdisk "/temp" {}

}node "n4" {

fastname "s4"pool "" "n4" "s4" "app1"resource disk "/orch/n4/d1" {}resource scratchdisk "/temp" {}

}

1

43

2

Configuring DSPX – Disk Pools

Page 223: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 217

node 1 node 2

Parallel to parallel flow may incur reshuffling:Records may jump between nodes

partitioner

Re-Partitioning

Page 224: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 218

N

Partitioner with parallel import

When a partitioner receives:

• sequential input (1 partition), it creates N partitions

• parallel input (N partitions),it outputs N partitions*,may result in re-partitioning

* Assuming no “constraints”

Nnode 1 node 2

Re-Partitioning X-ray

Page 225: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 219

Stage 2

Stage 1

partitioner

If Stage 2runs in parallel,DS/PX silently inserts a partitionerupstream of it.

If Stage 1 alsoruns in parallel,re-partitioning occurs.

partition 1 partition 2

In most cases, automatic re-partitioning is benign (noreshuffling), preserving the same partitioning as upstream.

Re-partitioning can be forcedto be benign, using either:

same preserve partitioning

Automatic Re-Partitioning

Page 226: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 220

Partitioning Methods

Auto

Hash

Entire

Range

Range Map

Page 227: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 221

• Collectors combine partitions of a dataset into a single input stream to a sequential Stage

data partitions

collector

sequential Stage

...

–Collectors do NOT synchronize data

Collectors

Page 228: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 222

Partitioning and Repartitioning Are Visible On Job Design

Partitioners and collectors have no stage nor icons of their own. They live live on input links of stages running in parallel (resp. sequentially). Link markings indicate their presence.

S----------------->S (no Marking)S----(fan out)--->P (partitioner)P----(fan in) ---->S (collector)P----(box)------->P (no reshuffling: partitioner using "SAME" method)P----(bow tie)--->P (reshuffling: partitioner using another method)

Collectors = inverse partitioners recollect rows from partitions into a single input stream to a sequential stage

Page 229: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 223

Partitioning and Collecting Icons

Partitioner Collector

Page 230: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 224

Setting a Node Constraint in the GUI

Page 231: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 225

Reading Messages in Director

Set APT_DUMP_SCORE to true

Can be specified as job parameter

Messages sent to Director log

If set, parallel job will produce a report showing the operators, processes, and datasets in the running job

Page 232: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 226

Messages With APT_DUMP_SCORE = True

Page 233: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 227

Exercise

Complete exercise 5-2

Page 234: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 228

Blank

Page 235: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 229

Module 6

Transforming Data

Page 236: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 230

Module Objectives

Understand ways DataStage allows you to transform data

Use this understanding to:– Create column derivations using user-defined code or

system functions– Filter records based on business criteria– Control data flow based on data conditions

Page 237: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 231

Transformed Data

Transformed data is:– Outgoing column is a derivation that may, or may not,

include incoming fields or parts of incoming fields– May be comprised of system variables

Frequently uses functions performed on something (ie. incoming columns)– Divided into categories – I.e.

Date and timeMathematicalLogicalNull handlingMore

Page 238: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 232

Stages Review

Stages that can transform data– Transformer

ParallelBasic (from Parallel palette)

– Aggregator (discussed in later module)

Sample stages that do not transform data– Sequential– FileSet– DataSet– DBMS

Page 239: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 233

Transformer Stage Functions

Control data flow

Create derivations

Page 240: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 234

Flow Control

Separate record flow down links based on data condition – specified in Transformer stage constraints

Transformer stage can filter records

Other stages can filter records but do not exhibit advanced flow control– Sequential– Lookup– Filter

Page 241: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 235

Rejecting Data

Reject option on sequential stage– Data does not agree with meta data– Output consists of one column with binary data type

Reject links (from Lookup stage) result from the drop option of the property “If Not Found”– Lookup “failed”– All columns on reject link (no column mapping option)

Reject constraints are controlled from the constraint editor of the transformer– Can control column mapping– Use the “Other/Log” checkbox

Page 242: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 236

Rejecting Data Example

“If Not Found” property

Contstraint –Other/log option

Property Reject Mode = Output

Page 243: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 237

Transformer Stage Properties

Link naming conventions are important because they identify appropriate links in the stage properties screen shown above.

Four quadrants:Incoming data link (one only)Outgoing links (can have multiple)Meta data for all incoming linksMeta data for all outgoing links – may have multiple tabs if there are multiple outgoing links

Note the constraints bar – if you double-click on any you will get screen for defining constraints for all outgoing links.

Page 244: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 238

Transformer Stage Variables

First of transformer stage entities to execute

Execute in order from top to bottom– Can write a program by using one stage variable to

point to the results of a previous stage variable

Multi-purpose– Counters– Hold values for previous rows to make comparison– Hold derivations to be used in multiple field dervations– Can be used to control execution of constraints

Page 245: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 239

Stage Variables

Show/Hide button

Page 246: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 240

Transforming Data

Derivations– Using expressions– Using functions

Date/time

Transformer Stage Issues– Sometimes require sorting before the transformer stage

– I.e. using stage variable as accumulator and need to break on change of column value

Checking for nulls

Page 247: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 241

Checking for Nulls

Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition

Can be handled in constraints, derivations, stage variables, or a combination of these

If you perform a lookup from a lookup stage and choose the continue option for a failed lookup, you have the possibility of nulls entering your data flow.

Page 248: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 242

Nullability

Can set the value of null; i.e.. If value of column is null put “NULL” in the outgoing column

WARNING messages in log. If source value is null,a fatal error occurs. Must handle in transformer.

not_nullablenullable

Source value or null propagates.nullablenullable

Source value propagates; destination value is never null.

nullablenot_nullable

Source value propagates to destination.

not_nullablenot_nullable

ResultDestination FieldSource Field

Page 249: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 243

Transformer Stage- Handling Rejects

1. Constraint Rejects– All expressions are false

and reject row is checked

2. Expression Error Rejects– Improperly Handled Null

Page 250: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 244

Transformer: Execution Order

• Derivations in stage variables are executed first

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns

Page 251: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 245

Two Transformers for the Parallel Palette

All > Processing >

Transformer

Is the non-Universe transformer

Has a specific set of functions

No DS routines available

Parallel > Processing

Basic Transformer

Makes server style transforms available on the parallel palette

Can use DS routines

No need for shared container to get Universe functionality on the parallel palette•Program in Basic for both transformers

Page 252: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 246

Transformer Functions From Derivation Editor

Data & Time

Logical

Mathematical

Null Handling

Number

Raw

String

Type Conversion

Utility

Page 253: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 247

Timestamps and Dates

Data & Time

Also some in Type Conversion

Page 254: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 248

Exercise

Complete exercises 6-1, 6-2, and 6-3

Page 255: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 249

Module 7

Sorting Data

Page 256: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 250

Objectives

Understand DataStage PX sorting options

Use this understanding to create sorted list of data to enable functionality within a transformer stage

Page 257: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 251

Sorting Data

Important because– Transformer may be using stage variables for

accumulators or control breaks and order is important– Other stages may run faster – I.e Aggregator– Facilitates the RemoveDups stage, order is important– Job has partitioning requirements

Can be performed – Option within stages (use input > partitioning tab and

set partitioning to anything other than auto)– As a separate stage (more complex sorts)

Page 258: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 252

Sorting Alternatives

• Alternative representation of same flow:

One of the nation's largest direct marketing outfits has been using this simple program in DS-PX (and its previous instantiations) for years.Householding yields enormous savings by avoiding mailing the same material (in particular expensive catalogs) to the same household.

Page 259: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 253

Sort Option on Stage Link

Stable will not rearrange records that are already in a properly sorted data set. If set to false no prior ordering of records is guaranteed to be preserved by the sorting operation.

Page 260: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 254

Sort Stage

Page 261: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 255

Sort Utility

DataStage – the default

SyncSort

UNIX

Page 262: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 256

Sort Stage - Outputs

Specifies how the output is derived

Page 263: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 257

Sort Specification Options

Input Link Property–Limited functionality–Max memory/partition is 20 MB, then

spills to scratchSort Stage–Tunable to use more memory before

spilling to scratch.Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE

Page 264: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 258

Removing Duplicates

Can be done by Sort – Use unique option

OR

Remove Duplicates stage– Has more sophisticated ways to remove duplicates

Page 265: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 259

Exercise

Complete exercise 7-1

Page 266: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 260

Blank

Page 267: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 261

Module 8

Combining Data

Page 268: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 262

Objectives

Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages

Use this understanding to create jobs that will– Combine data from separate input streams– Aggregate data to form summary totals

Page 269: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 263

Combining Data

There are two ways to combine data:

– Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g.,

JoinsLookupMerge

– Vertically:One input link, output with column combining values from all input rows. E.g.,

Aggregator

Page 270: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 264

Recall the Join, Lookup & Merge

Stages

These "three Stages" combine two or more input links according to values of user-designated "key" column(s).

They differ mainly in:– Memory usage– Treatment of rows with unmatched key values– Input requirements (sorted, de-duplicated)

Page 271: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 265

Joins - Lookup - Merge:Not all Links are Created Equal!

Joins Lookup Merge

Primary Input: port 0 Left Source MasterSecondary Input(s): ports 1,… Right LU Table(s) Update(s)

• Parallel Extender distinguishes between:- The Primary Input (Framework port 0)- Secondary - in some cases "Reference" (other ports)

• Naming convention:

Tip: Check "Input Ordering" tab to make sure intended Primary is listed first

The Framework concept of Port # is translated in the GUI by Primary/Reference

Page 272: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 266

Join Stage Editor

One of four variants:– Inner– Left Outer– Right Outer– Full Outer

Several key columns allowed

Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)

Page 273: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 267

1. The Join Stage

Four types:

2 sorted input links, 1 output link – "left" on primary input, "right" on secondary input– Pre-sort make joins "lightweight": few rows need to be in RAM

• Inner• Left Outer• Right Outer• Full Outer

Follow the RDBMS-style relational model: the operations Join and Load in RDBMS commute.

x-products in case of duplicates, matching entries are reusableNo fail/reject/drop option for missed matches

Page 274: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 268

2. The Lookup Stage

Combines: – one source link with– one or more duplicate-free table links

no pre-sort necessaryallows multiple keys LUTsflexible exception handling forsource input rows with no match

Lookup

Sourceinput

One or more tables (LUTs)

Output Reject

0

1

2

0

1

Contrary to Join, Lookup and Merge deal with missing rows.Obviously, a missing row cannot be captured, since it is missing. The closest thing one can capture is the corresponding unmatched row.

Lookup can capture in a reject link unmatched rows from the primary input(Source). That is why it has only one reject link (there is only one primary).We'll see the reject option is exactly the opposite with the Merge stage.

Page 275: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 269

The Lookup Stage

Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging)

–Space time trade-off: presort vs. in RAM table

On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link

On an SMP, no physical duplication of a Lookup Table occurs

Page 276: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 270

The Lookup Stage

Lookup File Set – Like a persistent data set only it contains

metadata about the key.– Useful for staging lookup tables

RDBMS LOOKUP– SPARSE

Select for each row.Might become a performance bottleneck.

– NORMAL Loads to an in memory hash table first.

Page 277: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 271

3. The Merge Stage

Combines – one sorted, duplicate-free master (primary) link with– one or more sorted update (secondary) links.– Pre-sort makes merge "lightweight": few rows need to be in RAM (as with

joins, but opposite to lookup).Follows the Master-Update model:– Master row and one or more updates row are merged if they have the same

value in user-specified key column(s).– A non-key column occurs in several inputs? The lowest input port number

prevails (e.g., master over update; update values are ignored) – Unmatched ("Bad") master rows can be either

kept dropped

– Unmatched ("Bad") update rows in input link can be captured in a "reject" link– Matched update rows are consumed.

Page 278: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 272

The Merge Stage

Allows composite keysMultiple update linksMatched update rows are consumed Unmatched updates can be captured LightweightSpace/time tradeoff: presorts vs. in-RAM table

Master One or more updates

Output Rejects

Merge0

0

21

21

Contrary to Lookup, Merges captures unmatched secondary (update) rows. Since there may be several update links, there may be several reject links.

Page 279: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 273

In this table:• , <comma> = separator between primary and secondary input links

(out and reject links)

Synopsis:Joins, Lookup, & Merge

Joins Lookup MergeModel RDBMS-style relational Source - in RAM LU Table Master -Update(s)Memory usage light heavy light

# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)Mandatory Input Sort both inputs no all inputsDuplicates in primary input OK (x-product) OK Warning!Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | dropOptions on unmatched secondary NONE NONE capture in reject set(s)On match, secondary entries are reusable reusable consumed

# Outputs 1 1 out, (1 reject) 1 out, (N rejects)Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

This table contains everything one needs to know to use the three stages.

Page 280: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 274

The Aggregator Stage

Purpose: Perform data aggregations

Specify:

Zero or more key columns that define the aggregation units (or groups)

Columns to be aggregated

Aggregation functions:count (nulls/non-nulls) sum

max/min/rangestandard error %coeff. of variationsum of weights un/corrected sum of squaresvariance mean standard deviation

The grouping method (hash table or pre-sort) is a performance issue

Page 281: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 275

Grouping Methods

Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed– doesn’t require sorted data– good when number of unique groups is small. Running tally

for each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM.

– Example: average family income by state, requires .05MB of RAM

Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out.– requires input sorted by grouping keys– can handle unlimited numbers of groups– Example: average daily balance by credit card

WARNING!

Hash has nothing to do with the Hash Partitioner! It says one hash table per group must be carried in RAMSort has nothing to do with the Sort Stage. Just says expects sorted input.

Page 282: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 276

Aggregator Functions

Sum

Min, max

Mean

Missing value count

Non-missing value count

Percent coefficient of variation

Page 283: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 277

Aggregator Properties

Page 284: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 278

Aggregation Types

Aggregation types

Page 285: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 279

Containers

Two varieties– Local– Shared

Local– Simplifies a large, complex diagram

Shared– Creates reusable object that many jobs can include

Page 286: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 280

Creating a Container

Create a job

Select (loop) portions to containerize

Edit > Construct container > local or shared

Page 287: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 281

Using a Container

Select as though it were a stage

Page 288: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 282

Exercise

Complete exercise 8-1

Page 289: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 283

Module 9

Configuration Files

Page 290: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 284

Objectives

Understand how DataStage PX uses configuration files to determine parallel behavior

Use this understanding to– Build a PX configuration file for a computer system– Change node configurations to support adding

resources to processes that need them– Create a job that will change resource allocations at the

stage level

Page 291: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 285

Configuration File Concepts

Determine the processing nodes and disk space connected to each node

When system changes, need only change the configuration file – no need to recompile jobs

When DataStage job runs, platform reads configuration file– Platform automatically scales the application to fit the

system

Page 292: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 286

Processing Nodes Are

Locations on which the framework runs applications

Logical rather than physical construct

Do not necessarily correspond to the number of CPUs in your system– Typically one node for two CPUs

Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node

Page 293: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 287

Optimizing Parallelism

Degree of parallelism determined by number of nodes defined

Parallelism should be optimized, not maximized– Increasing parallelism distributes work load but also

increases Framework overhead

Hardware influences degree of parallelism possible

System hardware partially determines configuration

The hardware that makes up your system partially determines configuration. For example, applications with large memory requirements, such as sort operations, are best assigned to machines with a lot of memory. Applications that will access an RDBMS must run on its server nodes; operators using other proprietary software, such as SAS or SyncSort, must run on nodes with licenses for that software.

Page 294: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 288

More Factors to Consider

Communication amongst operators– Should be optimized by your configuration– Operators exchanging large amounts of data should be

assigned to nodes communicating by shared memory or high-speed link

SMP – leave some processors for operating system

Desirable to equalize partitioning of data

Use an experimental approach– Start with small data sets– Try different parallelism while scaling up data set sizes

Page 295: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 289

Factors Affecting Optimal Degree of Parallelism

CPU intensive applications– Benefit from the greatest possible parallelism

Applications that are disk intensive– Number of logical nodes equals the number of disk

spindles being accessed

Page 296: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 290

PX Configuration File

Text file containing string data that is passed to the Framework– Sits on server side– Can be displayed and edited

Name and location found in environmental variable APT_CONFIG_FILE

Components– Node– Fast name– Pools– Resource

Page 297: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 291

Sample Configuration File

{

node “Node1"

{

fastname "BlackHole"

pools "" "node1"

resource disk"/usr/dsadm/Ascential/DataStage/Datasets" {pools "" }

resource scratchdisk"/usr/dsadm/Ascential/DataStage/Scratch" {pools "" }

}

}

For a single node system node name usually set to value of UNIX command –uname –n

Fastname attribute is the name of the node as it is referred to on the fastest network in the system, such as an IBM switch, FDDI, or BYNET.

The fast name is the physical node name that operators use to open connections for high-volume data transfers. Typically this is the principal node name as returned by the UNIX command uname –n.

Page 298: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 292

Node Options

• Node name - name of a processing node used by PX– Typically the network name– Use command uname -n to obtain network name

Fastname –– Name of node as referred to by fastest network in the system– Operators use physical node name to open connections– NOTE: for SMP, all CPUs share single connection to network

Pools– Names of pools to which this node is assigned– Used to logically group nodes– Can also be used to group resources

Resource– Disk– Scratchdisk

Set of node pool reserved names:DB2OracleInformixSasSortSyncsort

Page 299: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 293

Node Pools

Node “node1" {fastname “server_name”pool "pool_name”}

"pool_name" is the name of the node pool. I.e. “extra”

Node pools group processing nodes based on usage.– Example: memory capacity and high-speed I/O.

One node can be assigned to multiple pools.

Default node pool (” ") is made up of each node defined inthe config file, unless it’s qualified as belonging to a different pool and it is not designated as belonging to the default pool (see

following example).

Page 300: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 294

Resource Disk and Scratchdisk

node “node_0" {fastname “server_name”pool "pool_name”resource disk “path” {pool “pool_1”}resource scratchdisk “path” {pool “pool_1”}...}

Resource type can be disk(s) or scratchdisk(s)

"pool_1" is the disk or scratchdisk pool, allowing you to groupdisks and/or scratchdisks.

Page 301: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 295

Disk Pools

Disk pools allocate storage

Pooling applies to both disk types

By default, PX uses the default pool, specified by “”

pool "bigdata"

Page 302: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 296

Sorting Requirements

Resource pools can also be specified for sorting:

The Sort stage looks first for scratch disk resources in a “sort” pool, and then in the default disk pool

Sort uses as many scratch disks as defined in the first pool it finds

Recommendations:Each logical node defined in the configuration file that will run sorting operations should have its own sort disk.Each logical node's sorting disk should be a distinct disk drive or a striped disk, if it is shared among nodes.In large sorting operations, each node that performs sorting should have multiple disks, where a sort disk is a scratch disk available for sorting that resides in either the sort or default disk pool.

Page 303: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 297

{node "n1" {

fastname “s1"pool "" "n1" "s1" "sort"resource disk "/data/n1/d1" {}resource disk "/data/n1/d2" {}resource scratchdisk "/scratch" {"sort"}

}node "n2" {

fastname "s2"pool "" "n2" "s2" "app1"resource disk "/data/n2/d1" {}resource scratchdisk "/scratch" {}

}node "n3" {

fastname "s3"pool "" "n3" "s3" "app1"resource disk "/data/n3/d1" {}resource scratchdisk "/scratch" {}

}node "n4" {

fastname "s4"pool "" "n4" "s4" "app1"resource disk "/data/n4/d1" {}resource scratchdisk "/scratch" {}

}...

}

4 5

1

6

2 3

Configuration File: Example

Page 304: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 298

Resource Types

Disk

Scratchdisk

DB2

Oracle

Saswork

Sortwork

Can exist in a pool– Groups resources together

Page 305: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 299

Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type

In this instance, since a sparse lookup is viewed as the bottleneck, the stage has been set to execute on multiple nodes.

Page 306: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 300

Building a Configuration File

Scoping the hardware:– Is the hardware configuration SMP, Cluster, or MPP?– Define each node structure (an SMP would be single node):

Number of CPUsCPU speedAvailable memoryAvailable page/swap spaceConnectivity (network/back-panel speed)

– Is the machine dedicated to PX? If not, what other applications are running on it?

– Get a breakdown of the resource usage (vmstat, mpstat, iostat)

– Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?

Page 307: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 301

To Create Hardware Specifications

Hardware Type ___________

Node Name _____________

# CPUs _____________

CPU Speed _____________

Memory _____________

Page Space _____________

Swap Space _____________

TCP Addr Switch Addr

_____________ _____________

Complete one per node Complete one per disk subsystem

Hardware Type ________________

Shared between nodes Y or N

Storage Type ________________

Storage Size ________________

Read Cache size _______________

Write Cache size _______________

Read hit ratio _______________

Write hit ratio _______________

I/O rate (R/W) _______________

# channels/controllers____

Throughput ____

In addition, record all coexistence usage (other applications or subsystems sharing this disk subsystem).

Page 308: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 302

“Ballpark” Tuning

1. Calculate the theoretic I/O bandwidth available (don’t forget to reduce this by the amount other applications on this machine or others are impacting the I/O subsystem). See DS Performance and tuning for calculation methods.

2. Determine the I/O bandwidth being achieved by the DS application (rows/sec * bytes/row).

3. If the I/O rate isn’t approximately equal to the theoretical, there is probably a bottleneck elsewhere (CPU, Memory, etc).

4. Attempt to tune to the I/O bandwidth. 5. Pay particular attention to I/O intensive competing

workloads such as database logging, paging/swapping, etc.

As a rule of thumb, generally expect the application to be I/O constrained. First try to parallelize (spread out) the I/O as much as possible. To validate that you’ve achieved this:

Some useful commands:iostat (I/O activity)vmstat (system activity)mpstat (processor utilizationsar (resource usage)

When running out of capacity:Tuning: sequentially removing bottlenecks.Look at throughout on the I/O subsystem:

Take amount of data moved times number of I/O – gives bandwithLook at sar, iostat, or mpstat.

Page 309: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 303

Exercise

Complete exercise 9-1 and 9-2

Page 310: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 304

Blank

Page 311: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 305

Module 10

Extending DataStage PX

Page 312: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 306

Objectives

Understand the methods by which you can add functionality to PX

Use this understanding to:– Build a DataStage PX stage that handles special

processing needs not supplied with the vanilla stages– Build a DataStage PX job that uses the new stage

Page 313: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 307

PX Extensibility Overview

Sometimes it will be to your advantage to leverage PX’s extensibility. This extensibility includes:

Wrappers

Buildops

Custom Stages

Page 314: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 308

When To Leverage PX Extensibility

Types of situations:Complex business logic, not easily accomplished using standard PX stagesReuse of existing C, C++, Java, COBOL, etc…

Page 315: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 309

Wrappers vs. Buildop vs. Custom

Wrappers are good if you cannot or do not want to modify the application and performance is not critical.

Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces.

Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.

Page 316: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 310

Building “Wrapped” Stages

You can “wrapper” a legacy executable:Binary Unix commandShell script

… and turn it into a Parallel Extender stagecapable, among other things, of parallel execution…As long as the legacy executable is:

amenable to data-partition parallelismno dependencies between rows

pipe-safe can read rows sequentiallyno random access to data

Page 317: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 311

Wrappers (Cont’d)

Wrappers are treated as a black boxPX has no knowledge of contentsPX has no means of managing anything that occurs inside the wrapperPX only knows how to export data to and import data from the wrapperUser must know at design time the intended behavior of the wrapper and its schema interface

If the wrappered application needs to see all records prior to processing, it cannot run in parallel.

Page 318: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 312

LS Example

Can this command be wrappered?

Page 319: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 313

Creating a Wrapper

Used in this job ---

To create the “ls” stage

Example: wrapping ls Unix command:Ls /opt/AdvTrain/dnich would yield a list of files and subdirectories. The wrapper is thus comprised of the command and a parameter that contains a disk location.

Page 320: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 314

Creating Wrapped StagesFrom Manager:

Right-Click on Stage Type> New Parallel Stage > Wrapped

We will "Wrapper” an existing Unix executables – the ls command

Wrapper Starting Point

Page 321: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 315

Wrapper - General Page

Unix command to be wrapped

Name of stage

Unix ls command can take several arguments but we will use it in its simplest form:Ls locationWhere location will be passed into the stage with a job parameter.

Page 322: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 316

Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others.

The "Creator" Page

Page 323: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 317

Wrapper – Properties Page

If your stage will have properties appear, complete the Properties page

This will be the name of the property as it

appears in your stage

Page 324: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 318

Wrapper - Wrapped Page

Interfaces – input and output columns -these should first be entered into the table definitions meta data (DS Manager); let’s

do that now.

The Interfaces > input and output describe the meta data for how you will communicate with the wrapped application.

Page 325: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 319

• Layout interfaces describe what columns the stage:– Needs for its inputs (if any)– Creates for its outputs (if any)– Should be created as tables with columns in Manager

Interface schemas

Page 326: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 320

Column Definition for Wrapper Interface

Page 327: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 321

How Does the Wrapping Work?

– Define the schema for export and import

Schemas become interface schemas of the operator and allow for by-name column access

– Define multiple inputs/outputs required by UNIX executable

import

export

stdout ornamed pipe

stdin ornamed pipe

UNIX executable

output schema

input schema

QUIZ: Why does export precede import?

Answer: You must first EXIT the DS-PX environment to access the vanilla Unix environment. Then you must reenter the DS-PX environment.

Page 328: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 322

Update the Wrapper Interfaces

This wrapper will have no input interface – i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property.

Page 329: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 323

Resulting Job

Wrapped stage

Page 330: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 324

Job Run

Show file from Designer palette

Page 331: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 325

Wrapper Story: Cobol Application

Hardware Environment: – IBM SP2, 2 nodes with 4 CPU’s per node.

Software:– DB2/EEE, COBOL, PX

Original COBOL Application:– Extracted source table, performed lookup against table in DB2, and

Loaded results to target table.– 4 hours 20 minutes sequential execution

Parallel Extender Solution:– Used PX to perform Parallel DB2 Extracts and Loads– Used PX to execute COBOL application in Parallel– PX Framework handled data transfer between

DB2/EEE and COBOL application– 30 minutes 8-way parallel execution

Page 332: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 326

Buildops

Buildop provides a simple means of extending beyond the functionality provided by PX, but does not use an existing executable (like the wrapper)

Reasons to use Buildop include:Speed / PerformanceComplex business logic that cannot be easily representedusing existing stages– Lookups across a range of values– Surrogate key generation– Rolling aggregates

Build once and reusable everywhere within project, noshared container necessaryCan combine functionality from different stages into one

Page 333: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 327

BuildOps

– The user performs the fun tasks: encapsulate the business logic in a custom operator

– The Parallel Extender interface called “buildop” automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution.

– Exploits extensibility of PX Framework

Page 334: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 328

From Manager (or Designer):Repository pane:

Right-Click on Stage Type> New Parallel Stage > {Custom | Build | Wrapped}

• "Build" stages from within Parallel Extender

• "Wrapping” existing “Unix” executables

BuildOp Process Overview

That is the fun one! There is programming!

This section just sows the flavor of what Buildop can do.

Page 335: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 329

General Page

Identicalto Wrappers,except: Under the Build

Tab, your program!

Page 336: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 330

Logic Tab forBusiness Logic

Enter Business C/C++ logic and arithmetic in four pages under the Logic tab

Main code section goes in Per-Record page- it will be applied to all rows

NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of PX, it won’t compile within PX either!

The four tabs are Definitions, Pre-Loop, Per-Record, Post-Loop.

The main action is in Per-record.Definitions is used to declare and initialize variablesPre-loop has code to be performed prior to processing the first row.Post-loop has code to be performed after processing the last row

Page 337: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 331

Code Sections under Logic Tab

Temporary variables declared [and initialized] here

Logic here is executed once BEFORE processing the FIRST row

Logic here is executed once AFTER processing the LAST row

Page 338: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 332

I/O and Transfer

Under Interface tab: Input, Output & Transfer pages

Optional renaming of output port from default "out0"

Write row

Input page: 'Auto Read'Read next row

In-RepositoryTable Definition

'False' setting,not to interfere with Transfer page

First line: output 0

This is the Output page. This Input page is the same, except it has an "Auto Read," instead of the "Auto Write" column.

The Input/Output interface TDs must be prepared in advance and put in the repository.

Page 339: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 333

I/O and Transfer

• Transfer from input in0 to output out0.• If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written

First line:Transfer of index 0

The role of Transfer will be made clearer soon with examples.

Page 340: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 334

Building StagesSimple Example

Example - sumNoTransfer– Add input columns "a" and "b"; ignores other columns

that might be present in input– Produces a new "sum" column– Do not transfer input columns

sumNoTransfera:int32; b:int32

sum:int32

Only the column(s) explicitly listed in the output TD survive.

Page 341: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 335

NO TRANSFER

• Causes:

- RCP set to "False" in stage definitionand

- Transfer page left blank, or Auto Transfer = "False"

• Effects:- input columns "a" and "b" are not transferred- only new column "sum" is transferred

Compare with transfer ON…

From Peek:

No Transfer

Page 342: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 336

Transfer

TRANSFER

• Causes:- RCP set to "True" in stage definition

or- Auto Transfer set to "True"

• Effects:- new column "sum" is transferred, as well as- input columns "a" and "b" and- input column "ignored" (present in input, but

not mentioned in stage)

All the columns in the input link are transferred, irrespective of what the Input/Output TDs say.

Page 343: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 337

Out Table;

Output

Adding a Column With Row ID

Hint (resp. solution) of quiz on next slide (resp. note)

Page 344: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 338

Columns

DS-PX type

Defined in Table Definitions

Value refreshed from rowto row

Temp C++ variables

C/C++ type

Need declaration (inDefinitions or Pre-Loop page)

Value persistentthroughout "loop" over rows, unless modified in code

Columns vs. Temporary C++ Variables

ANSWER to QUIZ

Replacing index = count++; withindex++ ;

would result in index=1 throughout. See bottom bullet in left column.

Page 345: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 339

Exercise

Complete exercise 10-1 and 10-2

Page 346: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 340

Exercise

Complete exercises 10-3 and 10-4

Page 347: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 341

Custom Stage

Reasons for a custom stage:– Add PX operator not already in DataStage PX– Build your own Operator and add to DataStage PX

Use PX API

Use Custom Stage to add new operator to PX canvas

Page 348: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 342

Custom Stage

DataStage Manager > select Stage Types branch > right click

Page 349: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 343

Custom Stage

Name of Orchestrate operator to be used

Number of input and output links allowed

Page 350: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 344

Custom Stage – Properties Tab

Page 351: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 345

The Result

Page 352: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 346

Blank

Page 353: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 347

Module 11

Meta Data in DataStage PX

Page 354: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 348

Objectives

Understand how PX uses meta data, particularly schemas and runtime column propagation

Use this understanding to:– Build schema definition files to be invoked in DataStage

jobs– Use RCP to manage meta data usage in PX jobs

Page 355: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 349

Establishing Meta Data

Data definitions– Recordization and columnization– Fields have properties that can be set at individual field

levelData types in GUI are translated to types used by PX

– Described as properties on the format/columns tab (outputs or inputs pages) OR

– Using a schema file (can be full or partial)

Schemas– Can be imported into Manager– Can be pointed to by some job stages (i.e. Sequential)

Page 356: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 350

Data Formatting – Record Level

Format tab

Meta data described on a record basis

Record level properties

To view documentation on each of these properties, open a stage > input or output > format. Now hover your cursor over the property in question and help text will appear.

Page 357: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 351

Data Formatting – Column Level

Defaults for all columns

Page 358: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 352

Column Overrides

Edit row from within the columns tab

Set individual column properties

Page 359: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 353

Extended Column Properties

Field and string

settings

Page 360: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 354

Extended Properties – String Type

Note the ability to convert ASCII to EBCDIC

Page 361: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 355

Editing Columns

Properties depend on the data type

Page 362: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 356

Schema

Alternative way to specify column definitions for data used in PX jobs

Written in a plain text file

Can be written as a partial record definition

Can be imported into the DataStage repositoryrecord {final_delim=end, delim=",", quote=double}

( first_name: string;

last_name: string;

gender: string;

birth_date: date;

income: decimal[9,2];

state: string;

)

The format of each line describing a column is:

column_name:[nullability]datatype;

column_name. This is the name that identifies the column. Names must start with a letter or an underscore (_), and can contain only alphanumeric or underscore characters. The name is not case sensitive. The name can be of any length.

nullability. You can optionally specify whether a column is allowed to contain a null value, or whether this would be viewed as invalid. If the column can be null, insert the word ’nullable’. By default columns are not nullable.

You can also include ’nullable’ at record level to specify that all columns are nullable, then override the setting for individual columns by specifying ‘not nullable’. For example:

record nullable (name:not

nullable string[255];

value1:int32;

date:date)

datatype. This is the data type of the column.

Page 363: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 357

Creating a Schema

Using a text editor– Follow correct syntax for definitions– OR

Import from an existing data set or file set– On DataStage Manager import > Table Definitions >

Orchestrate Schema Definitions– Select checkbox for a file with .fs or .ds

Page 364: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 358

Importing a Schema

Schema location can be on the server or local

work station

Page 365: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 359

Data Types

Date

Decimal

Floating point

Integer

String

Time

Timestamp

Vector

Subrecord

Raw

Tagged

Raw – collection of untyped bytesVector – elemental arraySubrecord – a record within a record (elements of a group level)Tagged – column linked to another column that defines its datatype

Page 366: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 360

Partial Schemas

Only need to define column definitions that you are actually going to operate on

Allowed by stages with format tab– Sequential file stage– File set stage– External source stage– External target stage– Column import stage

Page 367: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 361

Runtime Column Propagation

DataStage PX is flexible about meta data. It can cope with the situation where meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP).

RCP is always on at runtime.

Design and compile time column mapping enforcement.

– RCP is off by default.– Enable first at project level. (Administrator

project properties)– Enable at job level. (job properties General

tab)– Enable at Stage. (Link Output Column tab)

Page 368: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 362

Enabling RCP at Project Level

Page 369: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 363

Enabling RCP at Job Level

Page 370: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 364

Enabling RCP at Stage Level

Go to output link’s columns tab

For transformer you can find the output links columns tab by first going to stage properties

Page 371: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 365

Using RCP with Sequential Stages

To utilize runtime column propagation in the sequential stage you must use the “use schema” option

Stages with this restriction:– Sequential– File Set– External Source– External Target

What is runtime column propagation?Runtime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see Link\xd2 ” on page - and on page - ) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are:

Sequential FileFile SetExternal Source External TargetColumn Import Column Export

Page 372: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 366

Runtime Column Propagation

When RCP is Disabled– DataStage Designer will enforce Stage Input Column to

Output Column mappings.– At job compile time modify operators are inserted on output

links in the generated osh.

Modify operators can add or change columns in a data flow.

Page 373: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 367

Runtime Column Propagation

When RCP is Enabled– DataStage Designer will not enforce mapping rules.– No Modify operator inserted at compile time.– Danger of runtime error if column names incoming do not

match column names outgoing link – case sensitivity.

Page 374: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 368

Exercise

Complete exercises 11-1 and 11-2

Page 375: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 369

Module 12

Job Control Using the Job Sequencer

Page 376: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 370

Objectives

Understand how the DataStage job sequencer works

Use this understanding to build a control job to run a sequence of DataStage jobs

Page 377: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 371

Job Control Options

Manually write job control– Code generated in Basic– Use the job control tab on the job properties page– Generates basic code which you can modify

Job Sequencer– Build a controlling job much the same way you build

other jobs– Comprised of stages and links– No basic coding

Page 378: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 372

Job Sequencer

Build like a regular job

Type “Job Sequence”

Has stages and links

Job Activity stage represents a DataStage job

Links represent passing control

Stages

Page 379: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 373

Example

Job Activity stage – contains

conditional triggers

Page 380: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 374

Job Activity Properties

Job parameters to be passed

Job to be executed –select from dropdown

Page 381: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 375

Job Activity Trigger

Trigger appears as a link in the diagram

Custom options let you define the code

Page 382: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 376

Options

Use custom option for conditionals– Execute if job run successful or warnings only

Can add “wait for file” to execute

Add “execute command” stage to drop real tables and rename new tables to current tables

Page 383: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 377

Job Activity With Multiple Links

Different links having different

triggers

Page 384: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 378

Sequencer Stage

Can be set to all or any

Build job sequencer to control job for the collections application

Page 385: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 379

Notification

Notification Stage

Page 386: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 380

Notification Activity

Page 387: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 381

Sample DataStage log from Mail Notification

Sample DataStage log from Mail Notification

Page 388: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 382

E-Mail Message

Notification Activity Message

Page 389: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 383

Exercise

Complete exercise 12-1

Page 390: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 384

Blank

Page 391: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 385

Module 13

Testing and Debugging

Page 392: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 386

Objectives

Understand spectrum of tools to perform testing and debugging

Use this understanding to troubleshoot a DataStage job

Page 393: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 387

Environment Variables

Environment fall in broad categories, listed in the left pane. We'll see these categories one by one. All environments values listed in the ADMINISTRATOR are the project-wide defaults. Can be modified by DESIGNER per job, and again, by DIRECTOR per run.

The default values are reasonable ones, there is no need for the beginning user to modify them, or even to know much about them--with one possible exception:APT_CONFIG_FILE-- see next slide.

Page 394: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 388

Parallel Environment Variables

Highlighted: APT_CONFIG_FILE, contains the path (on the server) of the active config file. The main aspect of a given configuration file is the # of nodes it declares. In the Labs, we used two files:One with one node declared; for use in sequential executionOne with two nodes declared; for use in parallel execution

Page 395: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 389

Environment VariablesStage Specific

The correct settings for these should be set at install. If you need to modify them, first check with your DBA.

Page 396: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 390

Environment Variables

These are for the user to play with. Easy: they take only TRUE/FALSE values. Control the verbosity of the log file. The defaults are set for minimal verbosity. The top one, APT_DUMP_SCORE, is an old favorite. It tracks datasets, nodes, partitions, and combinations --- all TBD soon. APT_RECORDS_COUNT helps you detect load imbalance.APT_PRINT_SCHEMAS shows the textual representation of the unformatted metadata at all stages.

Online descriptions with the "Help" button.

Page 397: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 391

Environment VariablesCompiler

You need to have these right to use the Transformer and the Custom stages. Only these stages invoke the C++ compiler.The correct values are listed in the Release Notes.

Page 398: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 392

Typical Job Log Messages:

Environment variables

Configuration File informationFramework Info/Warning/Error messages

Output from the Peek Stage

Additional info with "Reporting" environments

Tracing/Debug output– Must compile job in trace mode– Adds overhead

The Director

Page 399: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 393

• Job Properties, from Menu Bar of Designer• Director will

prompt you before eachrun

Job Level Environmental Variables

Project-wide environments set by ADMINISTRATOR can be modified on a job basis in DESIGNER's Job Properties, and on a run basis by DIRECTOR. Provides great flexibility.

Page 400: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 394

Troubleshooting

If you get an error during compile, check the following:

Compilation problems– If Transformer used, check C++ compiler, LD_LIRBARY_PATH– If Buildop errors try buildop from command line– Some stages may not support RCP – can cause column mismatch .– Use the Show Error and More buttons– Examine Generated OSH– Check environment variables settings

Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Page 401: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 395

Generating Test Data

Row Generator stage can be used– Column definitions– Data type dependent

Row Generator plus lookup stages provides good way to create robust test data from pattern files

Page 402: Ascential DS 324PX - V70 - Student Reference

Copyright ©2004 Ascential Software Corporation 396

Blank