DataStage Training Day-1

40
1 - 1 DataStage Training Day 1

Transcript of DataStage Training Day-1

Page 1: DataStage Training Day-1

1 - 1

DataStage

Training Day 1

Page 2: DataStage Training Day-1

1 - 2

Course Prerequisites

To gain full benefit from this class, you should have a strong working knowledge of:– Relational database management systems

(RDBMS) – Structured Query Language (SQL)– Windows GUI– Open Database Connectivity (ODBC)

Page 3: DataStage Training Day-1

1 - 3

What is DataStage?

Datastage is a comprehensive tool for the fast,easy creation and maintenance of data marts and datawarehouse.

Page 4: DataStage Training Day-1

1 - 4

DataStage Architecture

DataStage"Engine" Repository

SERVER

DataStageAdministrator

RepositoryManager

DataStageDesigner DataStage

Director

UniVerse Objects

Connections to data sources, e.g. ODBC, native API, FTP, etc.

Page 5: DataStage Training Day-1

1 - 5

Client Components• Designer:It creates Datastage jobs that are

compiled into executable programs.• Director: used to run and monitor the Datastage jobs.

• Manager:A user interface to edit and view contents of the repository and perform import and export of DataStage Components.

• Administrator: A user interface to perform administrative jobs like setting up users/adding projects.

Page 6: DataStage Training Day-1

1 - 6

DataStage Designer

Page 7: DataStage Training Day-1

1 - 7

Manager and Director

Page 8: DataStage Training Day-1

1 - 8

Repository

• Proprietary file format.

• Repository is created for each project.

• Single Server can handle multiple repositories.

• Stores all source/target definitions.

• Can be imported/exported in an XML format.

• Repository Contents.– Data Elements.

– Jobs.

– Routines.

– Shared Containers.

– Stage Types.

– Table Definitions.

– Transforms.

Page 9: DataStage Training Day-1

1 - 9

DataStage Administrator

Creating & attaching a project to DataStage Executing a UniVerse Command Cleaning Up a Project License Permissions Auto-Purge of Log Files Inactivity Timeout User for Scheduled Jobs User Name Troubleshooting

Page 10: DataStage Training Day-1

1 - 10

DataStage Designer

Stages

Design Area

repository

Page 11: DataStage Training Day-1

1 - 11

DataStage Server Stages

• Is like a station which performs certain tasks. E.g. Read data, Aggregate Data

• Each Stage has its own editable properties

Stages used most often are:• ODBC Stage• Sequential file Stage• Transformation Stage• Aggregator stage

Page 12: DataStage Training Day-1

1 - 12

ODBC Stage• ODBC stage can be used to extract,write and aggregate data

• ODBC stage can have any number of inputs and outputs.

• The three editable properties for this stage are “Stage”, “Inputs”, “Outputs”

Input: When data is added/Loaded to a table

Output: When data is read/Extracted from a table

Note:DSN has to be defined on the server machine as a system DSN.

Page 13: DataStage Training Day-1

1 - 13

ODBC Stage - Stage Properties

• Stage Name: Stage Name to identify that particular stage.

• You can add the DSN name from the drop down list.• Specify valid username and password for the same.

( Can be used as a job parameters entered during run time.)

• Specify the database specific quote character.• You can choose the character map for the ODBC

support ( necessary for supporting various languages)

Page 14: DataStage Training Day-1

1 - 14

ODBC Stage - Output Properties (Extract)

Define SQL Type• Generated Query: Depending upon the columns chosen• User Generated Query ( SQL Over Ride)• Stored Procedure Read.

Add multiple tables which have been imported before

Preview data

Page 15: DataStage Training Day-1

1 - 15

ODBC Stage - Output Properties Contd.. Specify joins if multiple tables are used or any clause to restrict data being extracted

Verify the SQL being generated to extract data. Can be user defined or generated

Isolation Level

Specify a suitable transaction isolation level. This specifies how potential conflicts between transactions (i.e., dirty reads, nonrepeatable reads, and phantom reads) are handled. Choose from:

         None. The default isolation level for the database is used.

         Read Uncommitted. Dirty reads, nonrepeatable reads, and phantom reads are all possible.

         Read Committed. Dirty reads are not possible. Nonrepeatable reads and phantom reads are both possible.

         Repeatable Read. Dirty reads and nonrepeatable reads are not possible, but phantom reads are possible.

         Serializable. Transactions are serializable. Dirty reads, nonrepeatable reads, and phantom reads are not possible.

         Versioning. Transactions are serializable, but have a higher concurrency than with Serializable. Versioning is implemented using a non-locking protocol.

         Auto commit. Each write is committed separately. Do not use this in conjunction with transaction groups

Page 16: DataStage Training Day-1

1 - 16

ODBC Stage - Extracting Data• Add table definitions before you can incorporate them

in ODBC stage• You can extract multiple tables from the same ODBC

stage by specifying joins in the selection clause• You can have user defined SQL or used stored

procedures to extract data• The columns loaded can be from multiple tables, but

column names have to be unique• You can preview data • All the columns loaded in the stage are owned by the

link which connects it to the next stage.

Page 17: DataStage Training Day-1

1 - 17

ODBC Stage - Input Properties (Load)•The input “stage” properties remain the same as the output properties.

•You choose the table name if the table definition has been imported before or create table in the target ODBC

•Choose the appropriate Update Action required.

Update actionSpecifies how the data is written. Choose the option you want from the drop-down list box:

•Clear the table, then insert rows. Deletes the contents of the table and adds the new rows. •Insert rows without clearing. Inserts the new rows in the table. •Replace existing rows completely. Deletes the existing rows, then adds the new rows to the table.•Update existing rows only. Updates the existing data rows. If a row with the supplied key does not exist in the table then the table is not updated but a warning is logged. •Insert new or update existing rows. New rows are added or, if the insert fails, the existing rows are updated.•Update existing or insert new rows. The existing data rows are updated or, if this fails, new rows are added. •Call stored procedure. The data is written using a stored procedure. When you select this option, the Procedure name field appears. •User-defined SQL. The data is written using a user-defined SQL statement. When you select this option, the View SQL tab is replaced by the Enter SQL tab.

Page 18: DataStage Training Day-1

1 - 18

ODBC Stage -Input Properties Columns

• Columns can be defined by adding a new row on the columns menu and editing the row contents

• You can Specify data types keys etc

• If the target table is already defined and imported you can load the table definition as you did before.

•Column definitions can also be used from the incoming link to this stage.

Right click

Page 19: DataStage Training Day-1

1 - 19

ODBC Stage - Create Table Option•You can define the ddl for table creation or it can be generated by the tool .

•You can choose to drop table before creating if necessary

Page 20: DataStage Training Day-1

1 - 20

ODBC Stage - Transaction Handling•Isolation levels are chosen as discussed before.

• Rows Per Transaction: This is the number of rows which are written before the data is committed to the data table. The default value is 0 signifying that al the rows are written before data is committed to the table

• Parameter Array Size: this is the number of rows written at a time. Default is 1signifying that each row is written in a separate operation . This should be changed/tuned depending upon the storage space required.

Note: if the above two settings conflict the Rows per transaction takes precedence.

Page 21: DataStage Training Day-1

1 - 21

ODBC Stage - Recap

• Define the connection

• Define/import the tables and the table metadata.

• Choose one of the following to read or write data– User Defined

– Generated

– Stored procedure

Page 22: DataStage Training Day-1

1 - 22

Sequential File Stage• Used to extract/read data or write data to a Flat file.

• The text file can be read/created from any file on the server or any drive mapped on the server.

• Can have any number of outputs and inputs

Page 23: DataStage Training Day-1

1 - 23

Sequential File Stage - Defining Input DataWhen you write data to a sequential file stage it has an input link. The properties are defined on the inputs page of the stage

File name

Note: The file has to imported in the repository before using it in this stage. The procedure is similar to importing ODBC table definitions

Backup existing file creates a back up before overwriting or appending to it. This can be used if a job is reset during execution

Update Action

Page 24: DataStage Training Day-1

1 - 24

Sequential File Stage - Input Data FormatFormat contains parameters that define the format of data in the file

Fixed width Column: If this is chosen the data is written in fixed width columns and the width is calculated according to the “Display” in column definition

First line is column name:

Omit last new line: Omits the last new line character in the file

Flush after every row: Used for named pipes only. Records between reader and writer are handled one record at a time

Delimiter:Specified if fixed width is not specified.

Quote Characters:

Default Padding

Default Null String

Page 25: DataStage Training Day-1

1 - 25

Sequential File Stage - Output Data FormatThe general page and the columns page remains the same for both read and write of data

Additional PropertiesMissing Column Action: Allows you to specify the action for columns missing in the data being read. Options are

• Pad with Null

•Pad with empty string

•Map empty String

Suppress for Truncation warnings:

This is used if you are reading only three columns from a file which has more than three columns defined.

Page 26: DataStage Training Day-1

1 - 26

Aggregator StageAggregator Stage classifies data rows from single input link into groups and computes totals or other aggregate functions. The sum/totals are output from the stage.You can only have on input link but many output links if the data has to be grouped in different ways.

Aggregator is an “Active Stage”

Before/After Routines: These are routines which have been included in the repository and can be executed before or after the stage.

E.g.: run a dos command or send email.

Page 27: DataStage Training Day-1

1 - 27

Aggregator Input DataData to be aggregated is passed from a previous stage link and that is used as input for the aggregator stage.The columns passing out of the previous stage are reflected on input columns tab.

All the column header except Sort and Sort order are discussed previously

Sort:Aggregator stage performs the grouping of columns in memory. If the input data coming from an ODBC or Sequential file stage is sorted ( using order by clause), there are performance efficiencies which can be gained.

This column defines the sort column position.

E.g. order by PRODUCT_ID,DATE_ID

Here the sort key for product column is 1 and for the date column is 2

Sort Order: Ascending( default)/Descending

Page 28: DataStage Training Day-1

1 - 28

Aggregator Output Data

The standard columns for data type appear with two changes:

Group: Specifies whether a give column is used in the group by clause.

Derivation: Each input column needs to have a derivation for this stage.This will be discussed in the next slide.

Note:Each column in the input stage must either be grouped or summarized.

Derivation editor

Page 29: DataStage Training Day-1

1 - 29

Aggregator Stage - Derivation Editor

Click if you want to group by a column

Iteratively choose each column and define whether to use a group by or use some aggregation function from the drop down list

Note: The functions available change according the underlying column data type.

Page 30: DataStage Training Day-1

1 - 30

Transformation Stage• Handles extracted data to perform conversions and pass data to another transformation stage or other DataStage stage

• Can have any number inputs or outputs.

• Has one primary data link and many reference links

Double click the stage to open the transformation editor

Link Area

Metadata Area

Page 31: DataStage Training Day-1

1 - 31

Transformation Editor• Link Area: Is where all the column definitions.expressions/stage variables and constraints are defined.The link area is divided in two parts the left pane shows you the input links and the right pane shows the output columns as well as their derivations,constraints and stage variables. Each of them will be explained in detail.

• Metadata Area: This shows the column metadata again split in two panes one for the input columns and one for the output columns. Here you can edit the columns as u seem necessary for their properties.

• Shortcut Area:

Page 32: DataStage Training Day-1

1 - 32

Transformer ConceptsWhen you use the transformation stage you have already defined what input data has to be passed and changed. The input columns are transformed in the transformation editor on the output pane for each column chosen.

Input Links: This stage has one primary link which is used to get bulk of the data to be transformed.It can also use many other links as reference links which will affect how the data is to be transferred. (Look Up Activity)

Reference Links:You can define key expressions for columns on the input columns on the reference links. The most common expression is specifying an equii-join between the primary table and reference table

Primary Link

Reference Link

Key Expression Linking two tables using a join

Page 33: DataStage Training Day-1

1 - 33

Transformation - Key ExpressionsKey expressions are used to affect how the data flowing thru the stage is affected. They are defined for reference link “key” columns. A column has to be defined as a key column by changing that field to yes in the reference link metadata area

CAT_ID is changed to a key column and then its key expression is defined in the link area

• A Join is specified between the primary link column CATEGORY_ID and the reference link key column CAT_ID

• The key expressions are validated by the expression editor. If they are incorrect they appear as red denoting error.

Page 34: DataStage Training Day-1

1 - 34

Transformation - Output Column Expressions

What can you do with the transformation stage?

Add delete Columns: You can add new columns by right clicking the output pane in the Link area and then defining the column metadata in the metadata area. The new column derivations can be custom strings/derivations/concatenations/standard functions applied on input columns or stage variables. We will discuss them shortly.

Move/edit Columns: In the output pane in the link area you can move the columns up and down by dragging. Also in the metadata area you can edit the column definitions as per requirements.

Before/After Routines: This is an active stage hence you can specify routines to be executed by passing suitable input values as discussed before.

Page 35: DataStage Training Day-1

1 - 35

Transformation Stage - Expression EditorExpression editor allows you to define expression for columns which have to be modified in the transformer stage. What can you do with the expression editor?

Where can you open/use the expression editor?

Entering Expressions: Whenever the insertion point is an expression box you can right click to open the suggested operand menu. This allows you a rich support of defining various elements and using predefine transformations/routines and functions. Also allows you to define expressions based on stage variables.

Continued…

Page 36: DataStage Training Day-1

1 - 36

Transformation Stage -Expression EditorSuggested Operand menu for Transformer stage

DS Macro: They are provided with DataStage and require no arguments you can select them from a drop down list

DS Functions:Provides a list of available DataStage functions which can be used.

DS constant: A list of predefined DataStage constants

Page 37: DataStage Training Day-1

1 - 37

Transformation Stage -Expression EditorSuggested Operand menu for Transformer stage

DS Routines: They are defined in the repository. You can use your own routines or use the ones provided by DataStage.

DS Transforms:Provides a list of available DataStage transforms. Replace the %Arg% with real arguments required by that particular transformation in the required format. They can be viewed in the Manager.

Please refer the manual for detailed description of each transform/routines

Page 38: DataStage Training Day-1

1 - 38

Transformation VariablesYou can define local variables for a given transformation stage. These can be used to define key expressions or the output column derivations.They can be also uses as follows

Stage variables are shown in the output pane of the transformations stage.

Page 39: DataStage Training Day-1

1 - 39

Transformation ConstraintsYou can define limits for output data by specifying constraints in the transformation stage. Constraints are BASIC expression which can be defined for each output link of a transformer stage. You can also specify reject links for rows which haven’t been written to any output stage. To define constraints do the following

Similar dialogue box appears.

Specify the constraints expression as desired.

This can be used while writing data for partitioned tables on certain keys. ( E.g. Key_ID> 10) or any other filtering criterion desired.

Page 40: DataStage Training Day-1

1 - 40

Folder Stage• Folder stages are used to read / write data as files in a directory located on a DataStage Server

• Folder stage can read multiple files from a single directory and can deliver the files to the job as rows on an output link

Plug-in Stage• Plugin stages are mostly used for access of external programs that can be leveraged by the DataStage applications.

• DataStage supports a number of plug-in stages to perform special tasks. They have to be installed separately on the server and have to be configured before use.

• After installing these stages they have to be registered on the server system before being used.

E.g.: ORABULK: bulk loading of data in Oracle database.

BCPLOAD: Bulk load for MS SQL Server.

Each plug-in stage has its own configurable properties which can be checked from the manuals. Please refer them if you plan to use it.