Imp Datastage New

download Imp Datastage New

of 158

Transcript of Imp Datastage New

  • 7/25/2019 Imp Datastage New

    1/158

    All Datastage Stages

    Datastage parallel stages groups

    DataStage and QualityStage stages are grouped into the following logical sections:

    General objects

    Data Quality Stages

    Database connectors

    Development and Debug stages

    File stages

    Processing stages

    Real ime stages

    Restructure Stages

    Se!uence activities

    Please refer to the list below for a description of the stages used in DataStage and QualityStage"#e classified all stages in order of importancy and fre!uency of use in real$life deployments %andalso on certification e&ams'" (lso) the most widely used stages are mar*ed bold or there is a lin*to a subpage available with a detailed description with e&amples"

    DataStage and QualityStage parallel stages and activities

  • 7/25/2019 Imp Datastage New

    2/158

    General elements

    Linkindicates a flow of the data" here are three main types of lin*s in Datastage: stream)reference and loo*up"

    Container%can be private or shared' $ the main outcome of having containers is to

    simplify visually a comple& datastage job design and *eep the design easy to understand"

    Annotationis used for adding floating datastage job notes and descriptions on a job

    canvas" (nnotations provide a great way to document the +, process and helpunderstand what a given job does"

    Description (nnotation shows the contents of a job description field" -ne description

    annotation is allowed in a datastage job"

    Debug and development stages

    Row generatorproduces a set of test data which fits the specified metadata %can be

    random or cycled through a specified list of values'" .seful for testing and development"/lic* herefor more""

    Column generatoradds one or more column to the incoming flow and generates test

    data for this column"

    Peekstage prints record column values to the job log which can be viewed in Director" 0t

    can have a single input lin* and multiple output lin*s"/lic* herefor more""

    http://mydatastage-notes.blogspot.in/2014/09/dummy-data-generation-using-row.htmlhttp://mydatastage-notes.blogspot.in/2014/09/peek-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/peek-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/dummy-data-generation-using-row.html
  • 7/25/2019 Imp Datastage New

    3/158

    Sample stage samples an input data set" -perates in two modes: percent mode and period

    mode"

    1ead selects the first 2 rows from each partition of an input data set and copies them to

    an output data set"

    ail is similiar to the 1ead stage" 0t select the last 2 rows from each partition"

    #rite Range 3ap writes a data set in a form usable by the range partitioning method"

    Processing stages

    Aggregatorjoins data vertically by grouping incoming data stream and calculating

    summaries %sum) count) min) ma&) variance) etc"' for each group" he data can begrouped using two methods: hash table or pre$sort" /lic* herefor more""

    http://mydatastage-notes.blogspot.in/p/aggregatorstage.htmlhttp://mydatastage-notes.blogspot.in/p/aggregatorstage.html
  • 7/25/2019 Imp Datastage New

    4/158

    Copy$ copies input data %a single stream' to one or more output data flows

    FPstage uses FP protocol to transfer data to a remote machine

    Filterfilters out records that do not meet specified re!uirements"/lic* herefor more""

    Funnelcombines mulitple streams into one" /lic*herefor more""

    !oincombines two or more inputs according to values of a *ey column%s'" Similiar

    concept to relational D43S SQ, join %ability to perform inner) left) right and full outerjoins'" /an have 5 left and multiple right inputs %all need to be sorted' and producessingle output stream %no reject lin*'" /lic*herefor more""

    Lookupcombines two or more inputs according to values of a *ey column%s'" ,oo*up

    stage can have 5 source and multiple loo*up tables" Records don6t need to be sorted andproduces single output stream and a reject lin*" /lic*herefor more""

    "ergecombines one master input with multiple update inputs according to values of a

    *ey column%s'" (ll inputs need to be sorted and unmatched secondary entries can becaptured in multiple reject lin*s" /lic*herefor more""

    "odi#ystage alters the record schema of its input dataset" .seful for renaming columns)

    non$default data type conversions and null handling

    Remove duplicatesstage needs a single sorted data set as input" 0t removes all duplicate

    records according to a specification and writes to a single output

    Slowly C$anging Dimensionautomates the process of updating dimension tables) where

    the data changes in time" 0t supports S/D type 5 and S/D type 7"/lic* herefor more""

    Sortsorts input columns"/lic* herefor more""

    rans#ormerstage handles e&tracted data) performs data validation) conversions and

    loo*ups"/lic* herefor more""

    /hange /apture $ captures before and after state of two input data sets and outputs a

    single data set whose records represent the changes made"

    /hange (pply $ applies the change operations to a before data set to compute an after

    data set" 0t gets data from a /hange /apture stage

    Difference stage performs a record$by$record comparison of two input data sets and

    outputs a single data set whose records represent the difference between them" Similiar to/hange /apture stage"

    http://mydatastage-notes.blogspot.in/p/datastagestages.htmlhttp://mydatastage-notes.blogspot.in/p/funnelstage.htmlhttp://mydatastage-notes.blogspot.in/p/funnelstage.htmlhttp://mydatastage-notes.blogspot.in/p/funnelstage.htmlhttp://mydatastage-notes.blogspot.in/p/multiple-join-stages-to-join-three.htmlhttp://mydatastage-notes.blogspot.in/p/multiple-join-stages-to-join-three.htmlhttp://mydatastage-notes.blogspot.in/p/multiple-join-stages-to-join-three.htmlhttp://mydatastage-notes.blogspot.in/p/lookupstage.htmlhttp://mydatastage-notes.blogspot.in/p/lookupstage.htmlhttp://mydatastage-notes.blogspot.in/p/lookupstage.htmlhttp://mydatastage-notes.blogspot.in/p/mergestage_5264.htmlhttp://mydatastage-notes.blogspot.in/p/mergestage_5264.htmlhttp://mydatastage-notes.blogspot.in/p/mergestage_5264.htmlhttp://mydatastage-notes.blogspot.in/p/what-is-scd-in-datastage-types-of-scd.htmlhttp://mydatastage-notes.blogspot.in/p/how-to-create-group-id-in-sort-stage-in.htmlhttp://mydatastage-notes.blogspot.in/p/surrogate-key-in-datastage-surrogate.htmlhttp://mydatastage-notes.blogspot.in/p/datastagestages.htmlhttp://mydatastage-notes.blogspot.in/p/funnelstage.htmlhttp://mydatastage-notes.blogspot.in/p/multiple-join-stages-to-join-three.htmlhttp://mydatastage-notes.blogspot.in/p/lookupstage.htmlhttp://mydatastage-notes.blogspot.in/p/mergestage_5264.htmlhttp://mydatastage-notes.blogspot.in/p/what-is-scd-in-datastage-types-of-scd.htmlhttp://mydatastage-notes.blogspot.in/p/how-to-create-group-id-in-sort-stage-in.htmlhttp://mydatastage-notes.blogspot.in/p/surrogate-key-in-datastage-surrogate.html
  • 7/25/2019 Imp Datastage New

    5/158

    /hec*sum $ generates chec*sum from the specified columns in a row and adds it to the

    stream" .sed to determine if there are differencies between records"

    /ompare performs a column$by$column comparison of records in two presorted input

    data sets" 0t can have two input lin*s and one output lin*"

    +ncode encodes data with an encoding command) such as g8ip"

    Decode decodes a data set previously encoded with the +ncode Stage"

    +&ternal Filter permits speicifying an operating system command that acts as a filter on

    the processed data

    Generic stage allows users to call an -S1 operator from within DataStage stage with

    options as re!uired"

    Pivot +nterprise is used for hori8ontal pivoting" 0t maps multiple columns in an input row

    to a single column in multiple output rows" Pivoting data results in obtaining a datasetwith fewer number of columns but more rows"

    Surrogate 9ey Generator generates surrogate *ey for a column and manages the *ey

    source"

    Switch stage assigns each input row to an output lin* based on the value of a selector

    field" Provides a similiar concept to the switch statement in most programminglanguages"

    /ompress $ pac*s a data set using a G0P utility %or compress command on

    ,02.;

  • 7/25/2019 Imp Datastage New

    6/158

    Se%uential #ileis used to read data from or write data to one or more flat %se!uential'

    files"/lic* herefor more""%=="'

    Data Setstage allows users to read data from or write data to a dataset" Datasets are

    operating system files) each of which has a control file %"ds e&tension by default' and one

    or more data files %unreadable by other applications'" /lic*herefor more info%=="'

    File Setstage allows users to read data from or write data to a fileset" Filesets are

    operating system files) each of which has a control file %"fs e&tension' and data files".nli*e datasets) filesets preserve formatting and are readable by other applications"

    Comple& #lat #ileallows reading from comple& file structures on a mainframe machine)

    such as 3>S data sets) header and trailer structured files) files that contain multiplerecord types) QS(3 and >S(3 files"/lic* herefor more info"

    +&ternal Source $ permits reading data that is output from multiple source programs"

    +&ternal arget $ permits writing data to one or more programs"

    ,oo*up File Set is similiar to FileSet stage" 0t is a partitioned hashed file which can be

    used for loo*ups"

    Database stages

    'racle (nterpriseallows reading data from and writing data to an -racle database

    %database version from ?"& to 5@g are supported'"

    http://mydatastage-notes.blogspot.in/p/sequentionalstage.htmlhttp://mydatastage-notes.blogspot.in/p/sequentionalstage.htmlhttp://mydatastage-notes.blogspot.in/2013/04/dataset.htmlhttp://mydatastage-notes.blogspot.in/2013/04/dataset.htmlhttp://mydatastage-notes.blogspot.in/2013/04/dataset.htmlhttp://mydatastage-notes.blogspot.in/2013/04/complex-flat-file-stages.htmlhttp://mydatastage-notes.blogspot.in/p/sequentionalstage.htmlhttp://mydatastage-notes.blogspot.in/2013/04/dataset.htmlhttp://mydatastage-notes.blogspot.in/2013/04/complex-flat-file-stages.html
  • 7/25/2019 Imp Datastage New

    7/158

    'D)C (nterprisepermits reading data from and writing data to a database defined as an

    -D4/ source" 0n most cases it is used for processing data from or to 3icrosoft (ccessdatabases and 3icrosoft +&cel spreadsheets"

    D)*+,D) (nterprisepermits reading data from and writing data to a D47 database"

    eradatapermits reading data from and writing data to a eradata data warehouse" hree

    eradata stages are available: eradata connector) eradata +nterprise and eradata3ultiload

    SQLServer (nterprisepermits reading data from and writing data to 3icrosoft SQ,l

    Server 7@@A amd 7@@B database"

    Sybasepermits reading data from and writing data to Sybase databases"

    Stored procedure stage supports -racle) D47) Sybase) eradata and 3icrosoft SQ,

    Server" he Stored Procedure stage can be used as a source %returns a rowset') as a target%pass a row to a stored procedure to write' or a transform %to invo*e procedure processingwithin the database'"

    3S -,+D4 helps retrieve information from any type of information repository) such as a

    relational source) an 0S(3 file) a personal database) or a spreadsheet"

    Dynamic Relational Stage %Dynamic D43S) DRS stage' is used for reading from or

    writing to a number of different supported relational D4 engines using native interfaces)such as -racle) 3icrosoft SQ, Server) D47) 0nformi& and Sybase"

    0nformi& %/,0 or ,oad'

    D47 .D4 %(P0 or ,oad'

    /lassic federation

    Red4ric* ,oad

    2ete88a +nterpise

    i#ay +nterprise

  • 7/25/2019 Imp Datastage New

    8/158

    Real ime stages

    -"L .nputstage ma*es it possible to transform hierarchical ;3, data to flat relational

    data sets

    -"L 'utputwrites tabular data %relational tables) se!uential files or any datastage data

    streams' to ;3, structures

    -"L rans#ormerconverts ;3, documents using an ;S, stylesheet

    /ebsp$ere "Qstages provide a collection of connectivity options to access 043

    #ebSphere 3Q enterprise messaging systems" here are two 3Q stage types available inDataStage and QualityStage: #ebSphere 3Q connector and #ebSphere 3Q plug$instage"

    #eb services client

    #eb services transformer

    Cava client stage can be used as a source stage) as a target and as a loo*up" he java

    pac*age consists of three public classes: com"ascentialsoftware"jds"/olumn)com"ascentialsoftware"jds"Row) com"ascentialsoftware"jds"Stage

    Cava transformer stage supports three lin*s: input) output and reject"

    #0SD 0nput $ 0nformation Services 0nput stage

    #0SD -utput $ 0nformation Services -utput stage

  • 7/25/2019 Imp Datastage New

    9/158

    Restructure stages

    /olumn e&port stage e&ports data from a number of columns of different data types into a

    single column of data type ustring) string) or binary" 0t can have one input lin*) one outputlin* and a rejects lin*" /lic* herefor more""

    /olumn import complementary to the /olumn +&port stage" ypically used to divide data

    arriving in a single column into multiple columns"

    /ombine records stage combines rows which have identical *eys) into vectors of

    subrecords"

    3a*e subrecord combines specified input vectors into a vector of subrecords whose

    columns have the same names and data types as the original vectors"

    3a*e vector joins specified input columns into a vector of columns

    Promote subrecord $ promotes input subrecord columns to top$level columns

    Split subrecord $ separates an input subrecord field into a set of top$level vector columns

    Split vector promotes the elements of a fi&ed$length vector to a set of top$level columns

    http://mydatastage-notes.blogspot.in/2014/09/blog-post.htmlhttp://mydatastage-notes.blogspot.in/2014/09/blog-post.html
  • 7/25/2019 Imp Datastage New

    10/158

    Data %uality QualityStage stages

    0nvestigate stage analy8es data content of specified columns of each record from the

    source file" Provides character and word investigation methods"

    3atch fre!uency stage ta*es input from a file) database or processing stages and

    generates a fre!uence distribution report"

    32S $ multinational address standari8ation"

    QualityStage ,egacy

    Reference 3atch

    Standari8e

    Survive

    .nduplicate 3atch

    #(>+S $ worldwide address verification and enhancement system"

  • 7/25/2019 Imp Datastage New

    11/158

    Se%uence activity stage types

    !ob Activityspecifies a Datastage server or parallel job to e&ecute"

    0oti#ication Activity$ used for sending emails to user defined recipients from within

    Datastage

    Se%uencerused for synchroni8ation of a control flow of multiple activities in a job

    se!uence"

    erminator Activitypermits shutting down the whole se!uence once a certain situation

    occurs"

    /ait #or #ile Activity$ waits for a specific file to appear or disappear and launches the

    processing"

    +nd,oop (ctivity

    +&ception 1andler

    +&ecute /ommand

    2ested /ondition

    Routine (ctivity

    Start,oop (ctivity

    .ser>ariables (ctivity

  • 7/25/2019 Imp Datastage New

    12/158

    Con#iguration #ile1

    he Datastage configuration file is a master control file %a te&tfile which sits on the

    server side' for jobs which describes the parallel system resources and architecture" he

    configuration file provides hardware configuration for supporting such architectures

    as S3P %Single machine with multiple /P. ) shared memory and dis*') Grid ) /luster or

    3PP %multiple /P.) mulitple nodes and dedicated memory per node'" DataStage understands

    the architecture of the system through this file"

    his is one of the biggest strengths of Datastage" For cases in which you have changed

    your processing configurations) or changed servers or platform) you will never have to worry

    about it affecting your jobs since all the jobs depend on this configuration file for e&ecution"

    Datastage jobs determine which node to run the process on) where to store the temporary data)where to store the dataset data) based on the entries provide in the configuration file" here is a

    default configuration file available whenever the server is installed"

    he configuration files have e&tension E"aptE" he main outcome from having the configuration

    file is to separate software and hardware configuration from job design" 0t allows changing

    hardware and software resources without changing a job design" Datastage jobs can point to

    different configuration files by using job parameters) which means that a job can utili8e different

    hardware architectures without being recompiled"

    he configuration file contains the different processing nodes and also specifies the dis*

    space provided for each processing node which are logical processing nodes that are specified in

    the configuration file" So if you have more than one /P. this does not mean the nodes in your

    configuration file correspond to these /P.s" 0t is possible to have more than one logical node on

    a single physical node" 1owever you should be wise in configuring the number of logical nodes

    on a single physical node" 0ncreasing nodes) increases the degree of parallelism but it does not

    necessarily mean better performance because it results in more number of processes" 0f your

    underlying system should have the capability to handle these loads then you will be having a

    very inefficient configuration on your hands"

    1. (P/-2F0GF0,+ is the file using which DataStage determines the configuration file (one

    can have many configuration files for a project)to be used" 0n fact) this is what is generally usedin production" 1owever) if this environment variable is not defined then how DataStagedetermines which file to use ??

    5" 0f the (P/-2F0GF0,+ environment variable is not defined then DataStage loo* for defaultconfiguration file %config"apt' in following path:

    5" /urrent wor*ing directory"7" 02S(,,D0R

  • 7/25/2019 Imp Datastage New

    13/158

    7" Define 2ode in configuration file( 2ode is a logical processing unit" +ach node in a configuration file is distinguished by a virtualname and defines a number and speed of /P.s) memory availability) page and swap space)networ* connectivity details) etc"

    3. hat are the different options a logical node can have in the configuration file?5" #astname 2he fastname is the physical node name that stages use to open connections for high

    volume data transfers" he attribute of this option is often the networ* name" ypically) you canget this name by using .ni& command Huname $nI"

    7" pools 22ame of the pools to which the node is assigned to" 4ased on the characteristics of theprocessing nodes you can group nodes into set of pools"

    5" ( pool can be associated with many nodes and a node can be part of many pools"7" ( node belongs to the default pool unless you e&plicitly specify apools list for it) and omit the

    default pool name %JK' from the list"L" ( parallel job or specific stage in the parallel job can be constrained to run on a pool %set of

    processing nodes'"5" 0n case job as well as stage within the job are constrained to run on specific processing nodesthen stage will run on the node which is common to stage as well as job"

    L" resourceM resource resource!type "location# $%pools "dis&!pool!name#' resourceresource!type "value#" resourcetype can becanonicalhostname %#hich ta*es !uoted ethernetname of a node in cluster that is unconnected to /onductor node by the hight speednetwor*"' or dis* %o read

  • 7/25/2019 Imp Datastage New

    14/158

    4asically the configuration file contains the different processing nodes and also

    specifies the dis* space provided for each processing node" 2ow when we tal* about processing

    nodes you have to remember that these can are logical processing nodes that are specified in the

    configuration file" So if you have more than one /P. this does not mean the nodes in your

    configuration file correspond to these /P.s" 0t is possible to have more than one logical node on

    a single physical node" 1owever you should be wise in configuring the number of logical nodes

    on a single physical node" 0ncreasing nodes) increases the degree of parallelism but it does not

    necessarily mean better performance because it results in more number of processes" 0f your

    underlying system should have the capability to handle these loads then you will be having a

    very inefficient configuration on your hands"

    2ow lets try our hand in interpreting a configuration file" ,ets try the below sample"

    node Jnode5

    fastname JS>R5

    pools JK

    resource dis* J/:

  • 7/25/2019 Imp Datastage New

    15/158

    between those two logical nodes" 2odeL on the other hand has its own dis* and scratch dis*

    space"

    Pools M Pools allow us to associate different processing nodes based on their functions and

    characteristics" So if you see an entry other entry li*e Jnode@K or other reserved node pools li*e

    JsortK)Kdb7K)etc"" hen it means that this node is part of the specified pool" ( node will be bydefault associated to the default pool which is indicated by JK" 2ow if you loo* at nodeL can see

    that this node is associated to the sort pool" his will ensure that that the sort stage will run only

    on nodes part of the sort pool"

    Resource dis* $ his will specify Specifies the location on your server where the processing

    node will write all the data set files" (s you might *now when Datastage creates a dataset) the

    file you see will not contain the actual data" he dataset file will actually point to the place where

    the actual data is stored" 2ow where the dataset data is stored is specified in this line"

    Resource scratchdis* M he location of temporary files created during Datastage processes) li*e

    loo*ups and sorts will be specified here" 0f the node is part of the sort pool then the scratch dis*can also be made part of the sort scratch dis* pool" his will ensure that the temporary filescreated during sort are stored only in this location" 0f such a pool is not specified then Datastagedetermines if there are any scratch dis* resources that belong to the default scratch dis* pool onthe nodes that sort is specified to run on" 0f this is the case then this space will be used"

    Below is the sample diagram for 1 node and 4 node resource allocation:

    SAMPLE CONFIGURATION FILES

    Configuration file for a simple SMP

    ( basic configuration file for a single machine) two node server %7$/P.' is shown below" hefile defines 7 nodes %node5 and node7' on a single devserver %0P address might be provided aswell instead of a hostname' with L dis* resources %d5 ) d7 for the data and Scratch as scratch

  • 7/25/2019 Imp Datastage New

    16/158

    space'"

    he configuration file is shown below:

    node Enode5E fastname EdevE pool EE resource dis* E

  • 7/25/2019 Imp Datastage New

    17/158

    node EnodeLE fastname EdevLE pool EE EnLE EsLE

    resource dis* E

  • 7/25/2019 Imp Datastage New

    18/158

    The stage executes in parallel mode

    by default if reading multiple files but executes sequentially if it is only reading one file.

    In order read a sequential file datastage needs to know about the format of the file.

    If you are reading a delimited file you need to specify delimiter in the format tab.

    Reading Fixed width File:

    ouble click on the sequential file stage and go to properties tab.

    Source:

    File:!i"e the file name including path

    Read Method:#hether to specify filenames explicitly or use a file pattern.

    Important Options:

    First Line is Column Names:If set true$ the first line of a file contains column names on writing and

    is ignored on reading.

    Keep File Partitions:Set True to partition the read data set according to the organi%ation of the input

    file&s'.

    Reect Mode:(ontinue to simply discard any re)ected rows* Fail to stop if any row is re)ected* +utput

    to send re)ected rows down a re)ect link.

    For fixed,width files$ howe"er$ you can configure the stage to beha"e differently:

    - ou can specify that single files can be read by multiple nodes. This can impro"e performance oncluster systems.

    - ou can specify that a number of readers run on a single node. This means$ for example$ that a

    single file can be partitioned as it is read.

    These two options are mutually exclusi"e.

    Scenario !:

    /eading file sequentially.

  • 7/25/2019 Imp Datastage New

    19/158

    Scenario ":

    /ead From 0ultiple 1odes 2 es

    +nce we add /ead From 0ultiple 1ode 2 es then stage by default executes in 3arallel mode.

  • 7/25/2019 Imp Datastage New

    20/158

    If you run the )ob with abo"e configuration it will abort with following fatal error.

    sff4SourceFile: The multinode option requires fixed length records.&That means you can use thisoption to read fixed width files only'

    In order to fix the abo"e issue go the format tab and add additions parameters as shown below.

    1ow )ob finished successfully and

    please below datastage monitor for performance impro"ements compare with reading from single

    node.

    Scenario #:/ead elimted file with By 5dding 1umber of /eaders 3ernode instead of multinode

    option to impro"e the read performance and once we add this option sequential file stage will execute

    in default parallel mode.

  • 7/25/2019 Imp Datastage New

    21/158

    If we are reading from and writing to fixed width file it is always good practice to add

    53T4ST/I1!435(65/ atastage 7n" "ariable and assign 898 as default "alue then it will pad with

    spaces $otherwise datastage will pad null "alue&atastage efault padding character'.

    5lways ;eep /e)ect 0ode 2 Fail to make sure datastage )ob will fail if we get from format from source

    systems.

    Se$uential File %est Per&ormance Settings'(ips

    Important Scenarios using se$uential &ilestage:

    Sequential file with uplicate /ecords

    Splitting input files into three different files using lookup

    Sequential fle with Duplicate Records

    Sequential fle with Duplicate Records:

    A sequential fle has 8 records with one column, below are the values in the column

    separated by space,

    1 1 2 2 ! " #

    $n a parallel %ob a&ter readin' the sequential fle 2 more sequential fles should be

    created, one with duplicate records and the other without duplicates(

    )ile 1 records separated by space: 1 1 2 2

    )ile 2 records separated by space: ! " #*ow will you do it

    Sol1:

    1( $ntroduce a sort sta'e very ne+t to sequential fle,

    2( Select a property -ey chan'e column. in sort sta'e and you can assi'n /0nique

    http://mydatastage-notes.blogspot.in/2014/09/sequential-file-best-performance.htmlhttp://mydatastage-notes.blogspot.in/2014/09/sequential-file-best-performance.htmlhttp://mydatastage-notes.blogspot.in/2014/09/sequential-file-with-duplicate-records.htmlhttp://mydatastage-notes.blogspot.in/2014/09/splitting-input-files-into-three.htmlhttp://mydatastage-notes.blogspot.in/2014/09/sequential-file-best-performance.htmlhttp://mydatastage-notes.blogspot.in/2014/09/sequential-file-with-duplicate-records.htmlhttp://mydatastage-notes.blogspot.in/2014/09/splitting-input-files-into-three.html
  • 7/25/2019 Imp Datastage New

    22/158

    or 10 duplicate or viceversa as you wish(

    ( ut a flter or trans&ormer ne+t to it and now you have unique in 1 lin- and

    duplicates in other lin-(

    Sol2:Should chec- thou'h.

    )irst o& all ta-e a source fle then connect it to copy sta'e( 3hen, 1 lin- is connected

    to the a''re'ator sta'e and another lin- is connected to the loo-up sta'e or %oin

    sta'e( $n A''re'ator sta'e usin' the count &unction, 4alculate how many times the

    values are repeatin' in the -ey column(

    A&ter calculatin' that it is connected to the flter sta'e where we flter the cnt51cnt

    is new column &or repeatin' rows.(

    3hen the o6p &rom the flter is connected to the loo-up sta'e as re&erence( $n the

    loo-up sta'e 79 )A$7R5R;43(

    3hen place two output lin-s &or the loo-up, ne collects the non0repeated values

    and another collects the repeated values in re%ect lin-(

    Splitting input #iles into t$ree di##erent #iles using lookup 1

    Splitting input files into three different files0nput file ( contains5

    7LUAXYB?5@

    input file 4 contains

    XYB?5@55575L

  • 7/25/2019 Imp Datastage New

    23/158

    5U5A

    -utput file ; contains5

    7LUA

    -utput file y containsXYB?5@

    -utput file 8 contains55575L5U5A

    Possible solution:

    /hange capture stage" First) i am going to use source as ( and refrerence as 4 both of them areconnected to /hange capture stage" From) change capture stage it connected to filter stage andthen targets ;)N and " 0n the filter stage: *eychange column7 it goes to ; V5)7)L)U)AW9eychange column@ it goes to N VX)Y)B)?)5@W 9eychange column5 it goes to V55)57)5L)5U)5AW

    Solution 7:/reate one p& job"src file se!5 %5)7)L)U)A)X)Y)B)?)5@'5st l*p se!7 %X)Y)B)?)5@)55)57)5L)5U)5A'o

  • 7/25/2019 Imp Datastage New

    24/158

    Dataset 1

    0nside a 0nfoSphere DataStage parallel job) data is moved around in data sets" hese

    carry meta data with them) both column definitions and information about t$e con#igurationthat

    was in effect when the data set was created" 0f for e&ample) you have a stage which limits

    e&ecution to a subset of available nodes) and the data set was created by a stage using all nodes)

    0nfoSphere DataStage can detect that the data will need repartitioning"

    0f re!uired) data sets can be landed as persistent data sets) represented by a Data Set

    stage "his is the most efficient way of moving data between lin*ed jobs" Persistent data sets are

    stored in a series of files lin*ed by a control file %note that you should not attempt to manipulate

    these files using .20; tools such as R3 or 3>" (lways use the tools provided with 0nfoSphere

    DataStage'"

    there are the two groups of Datasets $ persistent and virtual"he first type) persistent Datasets are mar*ed with 45dse&tensions) while for second type) virtual

    datasets 45ve&tension is reserved" %0t6s important to mention) that no Z"v files might be visible in

    the .ni& file system) as long as they e&ist only virtually) while inhabiting R(3 memory"

    +&tesion Z"v itself is characteristic strictly for -S1 $ the -rchestrate language of scripting'"

    Further differences are much more significant" Primarily) persistent Datasets are being stored

    in ,ni& #iles using internal Datastage (( #ormat) while virtual Datasets are never stored on

    disk$ they do e&ist within lin*s) and in ++ format) but in R(3 memory" Finally)

    persistent Datasets are readable and rewriteable wit$ t$e DataSet Stage) and virtual

    Datasets 6 mig$t be passed t$roug$ in memory"

    ( data set comprises a descriptor file and a number of other files that are added as the data set

    grows" hese files are stored on multiple dis*s in your system" ( data set is organi8ed in terms

    of partitionsand segments5

    +ach partition of a data set is stored on a single processing node" +ach data segment contains all

    the records written by a single job" So a segment can contain files from many partitions) and a

    partition has files from many segments"

    Firstly) as a single Dataset contains multiple records) it is obvious that all of them must undergo

    the same processes and modifications" 0n a word) all of them must go through the same

    successive stage"

    Secondly) it should be e&pected that different Datasets usually have different schemas) therefore

    they cannot be treated commonly"

    (lias names of Datasets are

  • 7/25/2019 Imp Datastage New

    25/158

    5' -rchestrate File

    7' -perating System file

    (nd Dataset is multiple files" hey are

    a' Descriptor File

    b' Data File

    c' /ontrol file

    d' 1eader Files

    0n Descriptor File) we can see the Schema details and address of data"

    0n Data File) we can see the data in 2ative format"

    (nd /ontrol and 1eader files resides in -perating System"

    Starting a Dataset "anager1

    Choose Tools Data Set Management, a Browse Files dialog o! appears:

    1. Navigate to the directory containing the data set you want to manage. By convention, data set

    files have the suffix .ds.

    2. Select the data set you want to manage and click OK. he !ata Set "iewer a##ears. $rom here

    you can co#y or delete the chosen data set. %ou can also view its schema &column definitions'

    or the data it contains.

    75 rans#ormer Stage 1

    85 Various functionalities of Transformer Stage:5.

    6. Generating surrogate key using Transformer7.

    8. Transformer stage using stripwhitespaces

    http://mydatastage-notes.blogspot.in/2013/05/generating-sequence-number-in-datastage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/transformer-stage-using.htmlhttp://mydatastage-notes.blogspot.in/2013/05/generating-sequence-number-in-datastage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/transformer-stage-using.html
  • 7/25/2019 Imp Datastage New

    26/158

    9.

    !. T"#$S%&"'(" ST#G( T& %)*T(" T+( ,#T#.

    -. T"#$S%&"'(" ST#G( S)$G /#,ST")$G %$0T)&$1. 2. 0&$0#T($#T( ,#T# S)$G T"#$S%&"'(" ST#G(5.6. %)(*, %$0T)&$ )$ T"#$S%&"'(" ST#G(7.

    8. T"#$S%&"'(" ST#G( 3)T+ S)'/*( (4#'/*(9.

    -!. T"#$S%&"'(" ST#G( %&" ,(/#"T'($T 3)S( ,#T#-.

    --. +&3 T& 0&$V("T "&3S )$T& T+( 0&*'$S )$ ,#T#ST#G(-1.

    -2. S&"T ST#G( #$, T"#$S%&"'(" ST#G( 3)T+ S#'/*( ,#T# (4#'/*(-5.

    -6. %)(*, %$0T)&$ )$ T"#$S%&"'(" ST#G( 3)T+ (4#'/*(-7.-8. ")G+T #$, *(%T %$0T)&$S )$ T"#$S%&"'(" ST#G( 3)T+ (4#'/*(

    29. S&'( &T+(" )'/&"T#$T %$0T)&$S:1!.

    +ow to perform aggregation using a Transformer1.

    ,ate an time string functions1-.

    $ull hanling functions11.

    Vector functionTransformer12.

    Type conersion functionsTransformer15.

    +ow to conert a single row into multiple rows

    ,ata Stage Transformer sage Guielines

    (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((

    Sort Stage1

    S&"T ST#G( /"&/("T)(S:

    S&"T ST#G( 3)T+ T3& ( V#*(S

    http://mydatastage-notes.blogspot.in/2013/03/transformer-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/transformer-stage-using-padstring.htmlhttp://mydatastage-notes.blogspot.in/2014/09/concatenate-data-using-transformer-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/field-function-in-transformer-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/find-totalscore-and-percentage-using.htmlhttp://mydatastage-notes.blogspot.in/2014/09/transformer-stage-for-department-wise.htmlhttp://mydatastage-notes.blogspot.in/2014/09/how-to-convert-rows-into-columns-in.htmlhttp://mydatastage-notes.blogspot.in/2014/09/sort-stage-and-transformer-stage-with.htmlhttp://mydatastage-notes.blogspot.in/2014/09/field-function-in-transformer-stage_1.htmlhttp://mydatastage-notes.blogspot.in/2014/09/right-and-left-functions-in-transformer.htmlhttp://mydatastage-notes.blogspot.in/2013/06/how-to-perform-aggregation-using.htmlhttp://mydatastage-notes.blogspot.in/2013/04/date-and-time-string-functions.htmlhttp://mydatastage-notes.blogspot.in/2013/04/null-handling-functions.htmlhttp://mydatastage-notes.blogspot.in/2013/04/vector-function-transformer.htmlhttp://mydatastage-notes.blogspot.in/2013/04/type-conversion-functions-transformer.htmlhttp://mydatastage-notes.blogspot.in/2013/06/how-to-convert-single-row-into-multiple.htmlhttp://mydatastage-notes.blogspot.in/2013/04/data-stage-transformer-usage-guidelines.htmlhttp://mydatastage-notes.blogspot.in/2013/03/transformer-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/transformer-stage-using-padstring.htmlhttp://mydatastage-notes.blogspot.in/2014/09/concatenate-data-using-transformer-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/field-function-in-transformer-stage.htmlhttp://mydatastage-notes.blogspot.in/2014/09/find-totalscore-and-percentage-using.htmlhttp://mydatastage-notes.blogspot.in/2014/09/transformer-stage-for-department-wise.htmlhttp://mydatastage-notes.blogspot.in/2014/09/how-to-convert-rows-into-columns-in.htmlhttp://mydatastage-notes.blogspot.in/2014/09/sort-stage-and-transformer-stage-with.htmlhttp://mydatastage-notes.blogspot.in/2014/09/field-function-in-transformer-stage_1.htmlhttp://mydatastage-notes.blogspot.in/2014/09/right-and-left-functions-in-transformer.htmlhttp://mydatastage-notes.blogspot.in/2013/06/how-to-perform-aggregation-using.htmlhttp://mydatastage-notes.blogspot.in/2013/04/date-and-time-string-functions.htmlhttp://mydatastage-notes.blogspot.in/2013/04/null-handling-functions.htmlhttp://mydatastage-notes.blogspot.in/2013/04/vector-function-transformer.htmlhttp://mydatastage-notes.blogspot.in/2013/04/type-conversion-functions-transformer.htmlhttp://mydatastage-notes.blogspot.in/2013/06/how-to-convert-single-row-into-multiple.htmlhttp://mydatastage-notes.blogspot.in/2013/04/data-stage-transformer-usage-guidelines.html
  • 7/25/2019 Imp Datastage New

    27/158

    +&3 T& 0"(#T( G"&/ ), )$ S&"T ST#G( )$ ,#T#ST#G(

    Group is are create in two ifferent ways.

    3e can create group i;s

  • 7/25/2019 Imp Datastage New

    28/158

    #n ,rag an ,rop in &utput

    Group );s will

  • 7/25/2019 Imp Datastage New

    29/158

    !12ash:

  • 7/25/2019 Imp Datastage New

    30/158

    )f we hae a ata as

  • 7/25/2019 Imp Datastage New

    31/158

    1@pinky@!2@lin@-!5@Him@!6@emy@1!7@pom@!8@Hem@-!9@in@1!!@en@-!

    Take Bo< ,esign as

  • 7/25/2019 Imp Datastage New

    32/158

    #n Take seCuential file to loa into the target.

    That is we can take like this

    SeC.%ile#ggregatorSeC.%ile

    "ea the ata in SeC.%ie

    #n in #ggregator Stage )n /roperties Select Group D,ept$o

    #n Select eAsal in 0olumn for calculations

    i.e

  • 7/25/2019 Imp Datastage New

    33/158

    (nd we need to get the same multiple times records into the one target"(nd single records not repeated with respected to dno need to come to one target"

    3y !uestion:

    0 placed 7 se! files) one with count 5 and other with count O5) 5 se! file output wasthis :dno count5@ L7@ 7

    7 se! file output was li*e this:

    dno countU@ 5L@ 5

    0nstead 0 wanted output li*e this:dno name5@ siva5@ ram5@ sam7@ tom7@ tiny

    7nd output file should be:dno nameL@ emyU@ remo

    !oin Stage1

    ')#T"(#E *O"N STAGES TO *O"N T%REE TA+#ES,

    )f we hae three ta

  • 7/25/2019 Imp Datastage New

    34/158

    !!-@merlin@tester@-!!!1@Honathan@eeloper@!!!2@morgan@tester@-!!!5@mary@tester@-!

    softAcomA-eptAno@Aname@locAi!@eeloper@-!!-!@tester@1!!

    softAcomA1locAi@aA@aA-!@mel

  • 7/25/2019 Imp Datastage New

    35/158

    ou can *earn more on Boin Stage with eample here

    *O"N STAGE $"T%O)T CO''ON KE- CO#)'N,

    )f we like to Hoin the ta

  • 7/25/2019 Imp Datastage New

    36/158

    )f we hae a Source ata as

  • 7/25/2019 Imp Datastage New

    37/158

    "ea an *oa the

  • 7/25/2019 Imp Datastage New

    38/158

    #e1tOuter *on,

    #ll the recors from left ta

  • 7/25/2019 Imp Datastage New

    39/158

    %ull &uter Boin:

    #ll recors an all matching recors:

  • 7/25/2019 Imp Datastage New

    40/158

    Lookup3Stage 1

    Loo0up Stage:

    The >ookup stage is most appropriate when the reference data for all lookup stages in a )ob

    is small enough to fit into a"ailable physical memory. 7ach lookup reference requires a contiguous

    block of shared memory. If the ata Sets are larger than a"ailable memory resources$ the ?+I1 or

    07/!7 stage should be used.

    >ookup stages do not require data on the input link or reference links to be sorted. Be aware$

    though$ that large in,memory lookup tables will degrade performance because of their paging

    requirements. 7ach record of the output data set contains columns from a source record plus columns

    from all the corresponding lookup records where corresponding source and lookup records ha"e the

    same "alue for the lookup key columns. (he loo0up 0e, columns do not ha*e to ha*e the same

    names in the primar, and the re&erence lin0s.

    The optional re)ect link carries source records that do not ha"e a corresponding entry in the

    input lookup tables.

    ou can also perform a range loo0up+which compares the "alue of a source column to a range of

    "alues between two lookup table columns. If the source column "alue falls within the required range$ a

    row is passed to the output link. 5lternati"ely$ you can compare the "alue of a lookup column to a

    range of "alues between two source columns. /ange lookups must be based on column "alues$ not

    constant "alues. 0ultiple ranges are supported.

  • 7/25/2019 Imp Datastage New

    41/158

    There are some special partitioning considerations for >ookup stages. ou need to ensure that the data

    being looked up in the lookup table is in the same partition as the input data referencing it. One wa,

    o& doing this is to partition the loo0up ta-les using the 4ntire method.

    Loo0up stage Con&iguration:4$ual loo0up

  • 7/25/2019 Imp Datastage New

    42/158

    ou can specify what action need to perform if lookup fails.

    Scenario=:Continue

  • 7/25/2019 Imp Datastage New

    43/158

    (hoose entire partition on the reference link

  • 7/25/2019 Imp Datastage New

    44/158

  • 7/25/2019 Imp Datastage New

    45/158

    Scenario:Fail

  • 7/25/2019 Imp Datastage New

    46/158

    ?o

    b aborted with the following error:

    stg5L0p+6: Failed a 0e, loo0up &or record " Ke, 7alues: C8S(OM4R5I3: #

    Scenari8@:3rop

  • 7/25/2019 Imp Datastage New

    47/158

    ScenarioA:Reect

  • 7/25/2019 Imp Datastage New

    48/158

    If we select re)ect as lookup failure condition then we need to add re)ect link otherwise we get

    compilation error.

  • 7/25/2019 Imp Datastage New

    49/158

  • 7/25/2019 Imp Datastage New

    50/158

    Range Loo0up:

    %usiness scenario:we ha"e input data with customer id and customer name and transaction date.#e

    ha"e customer dimension table with customer address information.(ustomer can ha"e multiple

    records with different start and acti"e dates and we want to select the record where incoming

    transcation date falls between start and end date of the customer from dim table.

    7x Input ata:

    (

  • 7/25/2019 Imp Datastage New

    51/158

  • 7/25/2019 Imp Datastage New

    52/158

    ou need to

    specify return multiple rows from the reference link otherwise you will get following warning in the )ob

    log.7"en though we ha"e two distinct rows base on customer4id$start4dt and end4dt columns but

    datastage is considering duplicate rows based on customer4id key only.

    stg5L0p+6: Ignoring duplicate entr, no &urther warnings will -e issued &or this ta-le

    Compile and Run the o-:

    Sce

    nario :Specify range on reference link:

  • 7/25/2019 Imp Datastage New

    53/158

    Thi

    s concludes lookup stage configuration for different scenarios.

    RANGE #OOK)( $"T% E&A'(#E "N ATASTAGE,

    "ange *ook p is use to check the range of the recors from another ta

  • 7/25/2019 Imp Datastage New

    54/158

    eAsal FDhsal

    0lick &k

    Than ,rag an ,rop the "eCuire columns into the output an click &k

    Gie %ile name to the Target %ile.

    Then 0ompile an "un the Bo< . That;s it you will get the reCuire &utput.

    $hy Entre /artton s used n #OOK)( stage

    (ntire partition has all ata across the noes So while matching?in lookup= the recors all ata shoul

  • 7/25/2019 Imp Datastage New

    55/158

    The 0erge stage is a processing stage. It can ha"e any number of input links$ a single output

    link$ and the same number of re)ect links as there are update input links.&according to S

    documentation'

    0erge stage combines a mster dataset with one or more update datasets based on the key

    columns.the output record contains all the columns from master record plus any additional columns

    from each update record that are required.

    5 master record and update record will be merged only if both ha"e same key column "alues.

    The data sets input to the 0erge stage must -e 0e, partitioned and sorted. This ensures

    that rows with the same key column "alues are located in the same partition and will be processed by

    the same node. It also minimi%es memory requirements because fewer rows need to be in memory at

    any one time.

    5s part of preprocessing your data for the 0erge stage$ you should also remo"e duplicate

    records from the master data set. If you ha"e more than one update data set$ you must remo"e

    duplicate records from the update data sets as well.

    ookup stages$ the 0erge stage allows you to specify se"eral re)ect

    links. ou can route update link rows that fail to match a master row down a re)ect link that is specific

    for that link. ou must ha"e the same number of re)ect links as you ha"e update links. The >ink

    +rdering tab on the Stage page lets you specify which update links send re)ected rows to which re)ect

    links. ou can also specify whether to drop unmatched master rows$ or output them on the output

    data link.

    7xample :

    0aster

    dataset:

    (

  • 7/25/2019 Imp Datastage New

    56/158

    +ptions:

  • 7/25/2019 Imp Datastage New

    57/158

    (o

    mpile and run the )ob :

    Scenario :

    /emo"e a record from the updateds= and check the output:

  • 7/25/2019 Imp Datastage New

    58/158

    (heck for the datastage warning in the )ob log as we ha"e selected #arn on unmatched masters 2

    T/ook at the output and it is clear that merge stage

    automatically dropped the duplicate record from master dataset.

    Scenario A:5dded new updatedataset which

    contains following data.

  • 7/25/2019 Imp Datastage New

    59/158

    (

  • 7/25/2019 Imp Datastage New

    60/158

    Scenario :add a duplicate row for customer4id2= in

    updateds= dataset.

    1ow we ha"e duplicate record both in master dataset

    and updateds=./un the )ob and check the results and warnings in the )ob log.

    1o change the results and merge stage automatically dropped the duplicate row.

    Scenario E:modify a duplicate row for customer4id2= in updateds= dataset with %ipcode as D8E@8

    instead of D8E8.

    /un the )ob and check output results.

  • 7/25/2019 Imp Datastage New

    61/158

    I ran the same )ob multiple times and found the

    merge stage is taking first record coming as input from the updateds= and dropping the next records

    with same customer id.

    This post co"ered most of the merge scenarios.

    5555555555555555555555555555555555555555555555555555555555555555555555555555555555555

    Filter3Stage 1

    Filter Stage:

    Filter stage is a processing stage used to filter database based on filter condition.

    The filter stage is configured by creating expression in the where clause.

    Scenario=:(heck for empty "alues in the customer name field.#e are reading from sequential file and

    hence we should check for empty "alue instead of null.

  • 7/25/2019 Imp Datastage New

    62/158

  • 7/25/2019 Imp Datastage New

    63/158

    Scenario :(omparing incoming fields.check transaction date falls between strt4dt and end4dt and

    filter those records.

    Input ata:

    (

  • 7/25/2019 Imp Datastage New

    64/158

  • 7/25/2019 Imp Datastage New

    65/158

    5ctual +utput:

    5ctual /e)ect ata:

    Scenario @:7"aluating input column

    data

    ex:#here (

  • 7/25/2019 Imp Datastage New

    66/158

    /e)ect :

    This co"ers most filter stage scenarios.

    !"#TER STAGE $"T% REA# T"'E E&A'(#E,

    %ilter Stage is use to write the conitions on 0olumns.

    3e can write 0onitions on any num

  • 7/25/2019 Imp Datastage New

    67/158

    3e can take SeCuential file to rea the an filter stage for writing 0onitions.

    #n ,ataset file to loa the ata into the Target.

    ,esign as follows:

    SeC.%ile%ilter,ataset%ile

    &pen SeCuential %ile #n

    "ea the ata.

    )n filter stage /roperties 3rite 0onition in 3here clause as

    eAsalD--!!

    Go to &utput ,rag an ,rop

    0lick &k

    Go to Target ,ataset file an gie some name to the file an that;s it

    0ompile an "un

    ou will get the reCuire output in Target file.

    )f you are trying to write conitions on multiple columns

    3rite conition in where clause

    an gie output likeD?*ink orer num

  • 7/25/2019 Imp Datastage New

    68/158

    Copy Stage 1

    CO(- STAGE,

    0opy Stage is one of the processing stage that hae one input an ;n; num

  • 7/25/2019 Imp Datastage New

    69/158

    It operates in @ modes:

    Continuous Funnelcombines records as they arri"e &i.e. no particular order'*

    Sort Funnelcombines the input records in the order defined by one or more key fields*

    Se$uencecopies all records from the first input data set to the output data set$ then all the records

    from the second input data set$ etc.

    Note:Metadata &or all inputs must -e identical.

    Sort &unnel requires data must be sorted and partitioned by the same key columns as to be used by

    the funnel operation.

    2ash Partition guarantees that all records with same key column "alues are located in the same

    partition and are processed in the same node.

  • 7/25/2019 Imp Datastage New

    70/158

    !1Continuous &unnel:

    !o to the properties of the funnel stage page and set Funnel Type to continuous funnel.

  • 7/25/2019 Imp Datastage New

    71/158

    'Se$uence:

  • 7/25/2019 Imp Datastage New

    72/158

    Note:In order to use se$uence &unnel ,ou need to speci&, which order the input lin0s ,ou

    need to process and also ma0e sure the stage runs in se$uential mode.

  • 7/25/2019 Imp Datastage New

    73/158

    Note: I& ,ou are running ,our sort &unnel stage in parallel+ ,ou should -e aware o& the

    *arious

    considerations a-out sorting data and partitions

    Thats all about funnel stage usage in datastage.

    !)NNE# STAGE $"T% REA# T"'E E&A'(#E

    Some times we get ata in multiple files which

  • 7/25/2019 Imp Datastage New

    74/158

    %or %unnel take the Bo< esign as

    "ea an *oa the ata into two seCuential files.

    Go to %unnel stage /roperties an

    Select %unnel Type D 0ontinous %unnel

    ? &r #ny other accoring to your reCuirement =

    Go to output ,rag an rop the 0olumns

    ? "emem

  • 7/25/2019 Imp Datastage New

    75/158

    )n orer to generate column ? for e: uniCueAi=

    %irst rea an loa the ata in seC.file

    Go to 0olumn Generator stage /roperties Select column metho as eplicit

    )n column to generate D gie column name ? %or e: uniCueAi=

    )n &utput rag an rop

    Go to column write column name an you can change ata type for uniCueAi in sCl type an

    can gie length with suita

  • 7/25/2019 Imp Datastage New

    76/158

    n records loaded"" 4y using surrogate *ey you can continue the se!uence from n[5"

    Surrogate 9ey Generator:

    The Surrogate ;ey !enerator stage is a processing stage that generates surrogate key columns and

    maintains the key source.

    5 surrogate key is a unique primary key that is not deri"ed from the data that it represents$ therefore

    changes to the data will not change the primary key. In a star schema database$ surrogate keys are

    used to )oin a fact table to a dimension table.

    surrogate key generator stage uses:

    (reate or delete the key source before other )obs run

  • 7/25/2019 Imp Datastage New

    77/158

    ;ey Source 5ction 2 create

    Source Type : FlatFile or atabase sequence&in this case we are using FlatFile'

    #hen you run the )ob it will create an empty file.

    If you want to the check the content change the Liew Stat File 2 7S and check the )ob log for details.

    s0e,5genstage+6: State &ile 'tmp's0e,cutomerdim.stat is empt,.

    if you try to create the same file again )ob will abort with the following error.

    s0e,5genstage+6: 8na-le to create state &ile 'tmp's0e,cutomerdim.stat: File exists.

    3eleting the 0e, source:

    8pdating the stat File:

    To update the stat file add surrogate key stage to the )ob with single input link from other stage.

    #e use this process to update the stat file if it is corrupted or deleted.

    ='open the surrogate key stage editor and go to the properties tab.

  • 7/25/2019 Imp Datastage New

    78/158

    If the stat file exists we can update otherwise we can create and update it.

    #e are using SkeyLalue parameter to update the stat file using transformer stage.

  • 7/25/2019 Imp Datastage New

    79/158

    !o to ouput and define the mapping like below.

    /owgen we are using =8 rows and hence when we run the )ob we see =8 skey "alues in the output.

    I ha"e updated the stat file with =88 and below is the output.

  • 7/25/2019 Imp Datastage New

    80/158

    If you want to generate the key "alue from begining you can use following property in the surrogate

    key stage.

    a. If the key source is a flat file$ specify how keys are generated:

    o To generate keys in sequence from the highest "alue that was last used$ set

    the !enerate ;ey from >ast 6ighest Lalue property to es. 5ny gaps in the key range are ignored.

    o To specify a "alue to initiali%e the key source$ add the File Initial Lalue property to the

    +ptions group$ and specify the start "alue for key generation.

    o To control the block si%e for key ranges$ add the File Block Si%e property to the +ptions

    group$ set this property to

  • 7/25/2019 Imp Datastage New

    81/158

    They are

    Type S0,

    Type- S0,

    Type1 S0,

    Type S0,: )n the type S0, methoology@ it will oerwrites the oler ata

    ? "ecors = with the new ata ? "ecors= an therefore it will not maintain the

    historical information.

    This will use for the correcting the spellings of names@ an for small upates of

    customers.

    Tpe - S0,: )n the Type- S0S methoology@ it will tracks the complete historical

    information

  • 7/25/2019 Imp Datastage New

    82/158

    information.

    %O$ TO )SE T-(E 32 SC "N ATASTAGES0,;S is nothing

  • 7/25/2019 Imp Datastage New

    83/158

    surrogate_key customer_id customer_name Location

    ------------------------------------------------

    1 1 Marspton Illions

    +ere the customer name is misspelt. )t shoul

  • 7/25/2019 Imp Datastage New

    84/158

    $ow again if the customer moes from seattle to $ework@ then the upate ta

  • 7/25/2019 Imp Datastage New

    85/158

    1 1 Marston Illions 1

    ! 1 Marston Seattle !

    $ow again if the customer is moe to another location@ a new recor will

  • 7/25/2019 Imp Datastage New

    86/158

    ! 1 Marston Seattle !1-%e&-!#11 NULL

    The $** in the (nA,ate inicates the current ersion of the ata an the remaining recorsinicate the past ata.

    S0,- )mplementation in ,atastage:

    Slowly changing dimension Type is a model where the whole history is stored in the database. 5n

    additional dimension record is created and the segmenting between the old record "alues and the new&current' "alue is easy to extract and the history is clear.The fields Meffecti"e dateM and Mcurrent indicatorM are "ery often used in that dimension and the facttable usually stores dimension key and "ersion number.S( implementation in atastageThe )ob described and depicted below shows how to implement S( Type in atastage. It is one ofmany possible designs which can implement this dimension.For this example$ we will use a table with customers data &itMs name is 4(ookups transformer does a lookup into a hashed file and maps new and old "alues toseparate columns.S( lookup transformer

    N 5 T884(heck4iscrepacies4exist transformer compares old and new "alues of records and passesthrough only records that differ.S( check discrepancies transformer

    N 5 T88@ transformer handles the

  • 7/25/2019 Imp Datastage New

    87/158

    3i"ot enterprise stage is a processing stage which pi"ots data "ertically and hori%ontally depending

    upon the requirements. There are two types

    =. 6ori%ontal

    . Lertical

    6ori%ontal 3i"ot operation sets input column to the multiple rows which is exactly opposite to the

    Lertical 3i"ot +peration. It sets input rows to the multiple columns.

    >etJs try to understand it one by one with following example.

    =. 2ori=ontal Pi*ot Operation.

    (onsider following Table.

    Product ype Color3: Color3* Color37

    Pen Nellow 4lue GreenDress Pin* Nellow Purple

    Step !: esign our ?ob Structure >ike below.

    (onfigure abo"e table with input sequential stage Pse_product_clr_detJ.

    Step ": >etJs configure>3i"ot enterprise stageJ. ouble click on it. Following window will pop up.

  • 7/25/2019 Imp Datastage New

    88/158

    Select PHorizontalJ for Pi*ot (,pefrom drop,down menu under PProperties tab for hori%ontal 3i"ot

    operation.

    Step #: (lick onPPivot PropertiesJ tab.

  • 7/25/2019 Imp Datastage New

    89/158

    Step ?: 1ow we ha"e to mention columns to be pi"oted under Derivation against columnColor.

    ouble click on it. Following #indow will pop up.

    Select columns to be pi"oted from Available columnJ pane as shown. (lick O.

    Step @:

  • 7/25/2019 Imp Datastage New

    90/158

    (onfigure output stage. !i"e the file path. See below image for reference.

    Step A: (ompile and /un the )ob. >etJs see what is happen to the output.

  • 7/25/2019 Imp Datastage New

    91/158

    This is how we can set multiple input columns to the single column &5s here for colors'.

    Vertical Pivot Operation:

    6ere$ we are going to use Pivot !nterprisestage to "ertically pi"ot data. #e are going to set multiple

    input rows to a single row. The main ad"antage of this stage is we can use aggregation functions like

    a"g$ sum$ min$ max$ first$ last etc. for pi"oted column. >etJs see how it works.

    (onsider an output data of 6ori%ontal +peration as input data for the 3i"ot 7nterprise stage. 6ere$ we

    will be adding one extra column for aggregation function as shown in below table.

    Product Color Pri;e

    Pen Nellow LB

    Pen 4lue UL

    Pen Green 7A

    Dress Pin* 5@@@

    Dress Nellow X?A

    Dress purple YLB

    >etJs study for "ertical pi"ot operation step by step.

    Step !: esign your )ob structure like below. (onfigure abo"e table data with input sequential file

    se_product_det.

    Step ": +pen Pivot !nterprisestage and select Pivot t"pe as *ertical underpropertiestab.

  • 7/25/2019 Imp Datastage New

    92/158

    Step #: ets see how to use

    P5ggregation functionsJ in next step.

    Step ?: +n clickingA##re#ation $unctions re%uired $or t&is column for particular column following

    window will pop up. In which we can select functions whiche"er required for that particular column.

    6ere we are using PminJ$ JmaxJ and avera#efunctions with proper precision and scale for Prize columnas shown.

  • 7/25/2019 Imp Datastage New

    93/158

    Step @:1ow we )ust ha"e to do mapping under output tab as shown below.

    Step A: compile and /un the )ob. >ets see what will be the output is.

    Output :

  • 7/25/2019 Imp Datastage New

    94/158

    One more approach:

    )any #eo#le have the following misconceptionsa*out +ivot stage.

    5' t converts rows into columns2' By using a #ivot stage, we can convert 1- rows into 1-- columns and 1-- columns

    into 1- rows

    ' %ou can add more #oints here//

    0et me first tell you that a +ivot stage only ON"3S O04)NS NO 3O5S and

    nothing else. Some !S +rofessionals refer to this as NO3)6076ON. 6nother fact a*out

    the +ivot stage is that it8s irre#lacea*le i.e no other stage has this functionality of

    converting columns into rows/// So , that makes it uni9ue, doesn8t///

    0et8s cover how exactly it does it....

    $or exam#le, lets take a file with the following fields: tem, ;uantity1, ;uantity2,

    ;uantity....

    tem

  • 7/25/2019 Imp Datastage New

    95/158

    2ow connect a Pivot stage from the ool pallette to the above output lin* and

    create an output lin* for the Pivot stage itself %fr enabling the -utput tab for the

    pivot stage'"

    .nli*e other stages) a pivot stage doesn6t use the generic G.0 stage page" 0t has a

    stage page of its own" (nd by default the -utput columns page would not have

    any fields" 1ence) you need to manually type in the fields" 0n this case just type in

    the 7 field names : 0tem and Quantity" 1owever manual typing of the columns

    becomes a tedious process when the number of fields is more" 0n this case you can

    use the 3etadata Save $ ,oad feature" Go the input columns tab of the pivot stage)

    save the table definitions and load them in the output columns tab" his is the way

    0 use it\\\

    2ow) you have the following fields in the -utput /olumn6s tab"""0tem and

    Quantity""""1ere comes the tric*y part i"e you need to specify theD+R0>(0-2 """"0n case the field names of -utput columns tab are same as the

    0nput tab) you need not specify any derivation i"e in this case for the 0tem field)

    you need not specify any derivation" 4ut if the -utput columns tab has new field

    names) you need to specify Derivation or you would get a R.2$03+ error for

    free""""

    For our e&ample) you need to type the Derivation for the Quantity field as

    /olumn name Derivation

    0tem 0tem (or you can leave this blan&)Quantity Quantity5) Quantity7) QuantityL"

    Cust attach another file stage and view your output\\\ So) objective met\\\

    Se%uence3Activities 1

    In this article i will explain how to use datastage looping acit"ities in sequencer.

    I ha"e a requirement where i need to pass file id as parameter reading from a file.In Future file idJs

    will increase so that i donJt ha"e to add )ob or change sequencer if I take ad"antage of datastage

    looping.

    (ontents in the File:

    =Q88

  • 7/25/2019 Imp Datastage New

    96/158

    Q@88

    @QA88

    I need to read the abo"e file and pass second field as parameter to the )ob.I ha"e created one parallel

    )ob with pFileI as parameter.

    Step:= (ount the number of lines in the file so that we can set the upper limit in the datastage startloop acti"ity.

    sample routine to count lines in a file:

    5rgument : File1ame&Including path'

    effun S/0essage&5=$ 5$ 5@' (alling R-ataStage-S/407SS5!7

    7quate /outine1ame To R(ount>ines

    (ommand 2 Rwc ,l: R:File1ame:Q awk Pprint U=VJ

    (all S>ogInfo&R7xecuting (ommand To !et the /ecord (ount R$(ommand'

    - call support routine that executes a Shell command.

    (all S7xecute&RogInfo&R6ere is the /ecord (ount In R:File1ame: 2 R:+utput$+utput'

    5ns 2 +utput

    -!oTo 1ormal7xit

    7nd 7lse

    (all S>ogInfo&R7rror when executing command R$(ommand'

    (all S>ogFatal&+utput$ /outine1ame'

    5ns 2 =

    7nd

  • 7/25/2019 Imp Datastage New

    97/158

    1ow we use start>oop.U(ounter "ariable to get the file id by using combination of grep and awk

    command.

    for each iteration it will get file id.

  • 7/25/2019 Imp Datastage New

    98/158

    Finally the seq )ob looks like below.

    I hope e"ery one likes this post.

    222222222222222222222222222222222222222222222222222222222222222

    RA0SF'R"(R SAG( ' F.L(R

  • 7/25/2019 Imp Datastage New

    99/158

    -@tom@eeloper@-!1@Him@clerck@!2@on@tester@1!5@Ieera@eeloper@-!6@arun@clerck@!7@luti@prouction@2!8@raHa@priuction@2!

    #n our reCuirement is to get the target ata as

  • 7/25/2019 Imp Datastage New

    100/158

    defiantly u get successful massage$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

    Question: 0 want to process L files in se!uentially one by one how can i do that" while processingthe files it should fetch files automatically "

    (ns:0f the metadata for all the files r same then create a job having file name as parameter thenuse same job in routine and call the job with different file name"""or u can create se!uencer to usethe job""$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$Parameteri8e the file name"4uild the job using that parameter4uild job se!uencer which will call this job and will accept the parameter for file name"#rite a .20; shell script which will call the job se!uencer three times by passing different fileeach time"R+: #hat 1appens if R/P is disable ]

    0n such case -sh has to perform 0mport and e&port every time whenthe job runs and theprocessing time job is also increased"""$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$Runtime column propagation %R/P': 0f R/P is enabled for any job and specifically for thosestages whose output connects to the shared container input then meta data will be propagated atrun time so there is no need to map it at design time"0f R/P is disabled for the job in such case -S1 has to perform 0mport and e&port every timewhen the job runs and the processing time job is also increased"hen you have to manually enter all the column description in each stage"R/P$ Runtime columnpropagation

    Question:Source: arget

    +no +name +no +name5 a)b 5 a7 c)d 7 bL e)f L c

    Di##erence )etween !oin>Lookup and "erge 1

  • 7/25/2019 Imp Datastage New

    101/158

    Datastage Scenarios and solutions 1

    Field mapping using ransformer stage:

    Re!uirement:field will be right justified 8ero filled) a*e last 5B characters

    Solution:Right%E@@@@@@@@@@E:rim%,n*;fmrans"lin*')5B'

    Scenario 5:

    #e have two datasets with U cols each with different names" #e should create a dataset with Ucols L from one dataset and one col with the record count of one dataset"

    #e can use aggregator with a dummy column and get the count from one dataset and do a loo*up from other dataset and map it to the L rd dataset

  • 7/25/2019 Imp Datastage New

    102/158

    Something similar to the below design:

    Scenario 7:

    Following is the e&isting job design" 4ut re!uirement got changed to: 1ead and trailer datasetsshould populate even if detail records is not present in the source file" 4elow job don6t do thatjob"

    1ence changed the above job to this following re!uirement:

  • 7/25/2019 Imp Datastage New

    103/158

    .sed row generator with a copy stage" Given default value% 8ero' for col% count' coming in from

    row generator" 0f no detail records it will pic* the record count from row generator"

    #e have a source which is a se!uential file with header and footer" 1ow to remove the headerand footer while reading this file using se!uential file stage of Datastage]Sol:3ype command in putty: sed d< fle?name@new?fle?name type thisin %ob be&ore %ob subroutine then use new fle in seq sta'e.

    0F 0 1(>+ S-.R/+ ,09+ /-,5 ( ( 4 (2D (RG+ ,09+ /-,5 /-,7 ( 5 ( 7 45" 1-#- (/10+>+ 10S -.P. .S02G S(G+ >(R0(4,+ 02 R(2SF-R3+R S(G+]

    0f *ey/hange 5 hen 5 +lse stagevaraible[5

    Suppose that U job control by the se!uencer li*e %job 5) job 7) job L) job U 'if job 5 have 5@)@@@row )after run the job only A@@@ data has been loaded in target table remaining are not loaded andyour job going to be aborted then"" 1ow can short out the problem"Suppose job se!uencersynchronies or control U job but job 5 have problem) in this condition should go director andchec* it what type of problem showing either data type problem) warning massage) job fail or jobaborted) 0f job fail means data type problem or missing column action "So u should go Runwindow $/lic*$ racing$Performance or 0n your target table $general $ action$ select this

    option here two option%i' -n Fail $$ commit ) /ontinue%ii' -n S*ip $$ /ommit) /ontinue"First u chec* how many data already load after then select on s*ip option then continue and whatremaining position data not loaded then select -n Fail ) /ontinue """""" (gain Run the jobdefiantly u get successful massageQuestion: 0 want to process L files in se!uentially one by one how can i do that" while processingthe files it should fetch files automatically "

  • 7/25/2019 Imp Datastage New

    104/158

    (ns:0f the metadata for all the files r same then create a job having file name as parameter thenuse same job in routine and call the job with different file name"""or u can create se!uencer to usethe job""

    Parameteri8e the file name"

    4uild the job using that parameter4uild job se!uencer which will call this job and will accept the parameter for file name"#rite a .20; shell script which will call the job se!uencer three times by passing different fileeach time"R+: #hat 1appens if R/P is disable ]

    0n such case -sh has to perform 0mport and e&port every time when the job runs and theprocessing time job is also increased"""Runtime column propagation %R/P': 0f R/P is enabled for any job and specifically for thosestages whose output connects to the shared container input then meta data will be propagated atrun time so there is no need to map it at design time"

    0f R/P is disabled for the job in such case -S1 has to perform 0mport and e&port every timewhen the job runs and the processing time job is also increased"hen you have to manually enter all the column description in each stage"R/P$ Runtime columnpropagation

    Question:Source: arget

    +no +name +no +name5 a)b 5 a7 c)d 7 bL e)f L c

    source has 7 fields li*e

    /-3P(2N ,-/(0-2043 1ND/S 4(2043 /1+1/, 1ND/S /1+043 4(21/, 4(21/, /1+

    ,09+ 10S"""""""

    1+2 1+ -.P. ,--9S ,09+ 10S""""

  • 7/25/2019 Imp Datastage New

    105/158

    /ompany loc count

    /S 1ND L 4(2 /1+043 1ND L 4(2 /1+1/, 1ND L 4(2 /1+7'input is li*e this:no)char5)a7)bL)aU)bA)aX)aY)bB)a

    4ut the output is in this form with row numbering of Duplicate occurence

    output:

    no)char)/ountE5E)EaE)E5EEXE)EaE)E7EEAE)EaE)ELEEBE)EaE)EUEELE)EaE)EAEE7E)EbE)E5EEYE)EbE)E7EEUE)EbE)ELEL'0nput is li*e this:file55@7@5@5@7@

  • 7/25/2019 Imp Datastage New

    106/158

    L@

    -utput is li*e:

    file7 fileL%duplicates'5@ 5@7@ 5@L@ 7@

    U'0nput is li*e:file55@7@5@5@7@L@

    -utput is li*e 3ultiple occurrences in one file and single occurrences in one file:file7 fileL5@ L@5@5@7@7@

    A'0nput is li*e this:file55@7@5@5@7@L@

    -utput is li*e:file7 fileL5@ L@7@

    X'0nput is li*e this:file557LU

  • 7/25/2019 Imp Datastage New

    107/158

    AXYB?

    5@

    -utput is li*e:file7%odd' fileL%even'5 7L UA XY B? 5@

    Y'1ow to calculate Sum%sal') (vg%sal') 3in%sal') 3a&%sal' with out

    using (ggregator stage""

    B'1ow to find out First sal) ,ast sal in each dept with out using aggregator stage

    ?'1ow many ways are there to perform remove duplicates function with out usingRemove duplicate stage""

    Scenario:

    source has 7 fields li*e

    /-3P(2N ,-/(0-2043 1ND/S 4(2043 /1+1/, 1ND/S /1+043 4(21/, 4(21/, /1+

    ,09+ 10S"""""""

    1+2 1+ -.P. ,--9S ,09+ 10S""""

    /ompany loc count

    /S 1ND L 4(2

  • 7/25/2019 Imp Datastage New

    108/158

    /1+043 1ND L 4(2 /1+1/, 1ND L

    4(2 /1+

    Solution:

    Seqfile......XSort......XTrans......X/emo"euplicates..........ataset

    Sort Trans:

    ;ey2(ompany create stage "ariable as (ompany=Sort order25sc (ompany=2If&in.keychange2=' then in.>ocation 7lse

    (ompany=:M$M:in.>ocationcreate keychange2True rag and rop in deri"ation

    (ompany ....................(ompany

    (ompany=........................>ocation/emo"eup:

    ;ey2(ompany

    uplicates To /etain2>ast

    =='The input isShirtQredQblueQgreen3antQpinkQredQblue

    +utput should be$

    Shirt:red

    Shirt:blueShirt:green

    pant:pinkpant:red

    pant:blue

    Solution:

    it is re"erse to pi"ote stage

    useseq,,,,,,sort,,,,,,tr,,,,rd,,,,,tr,,,,tg

    in the sort stage use create key change column is true

    in trans create stage "ariable2if colu2= then key c."alue else key "::columrd stage use duplicates retain last

    tran stage use field function superate columns

    similar Scenario: :

  • 7/25/2019 Imp Datastage New

    109/158

    sourcecol= col@

    = samsung= nokia

    = ercisson iphone

    motrolla@ la"a

    @ blackberry@ reliance

    7xpected +utput

    col = col col@ colA= samsung nokia ercission

    iphone motrolla@ la"a blackberry reliance

    ou can get it by using Sort stage ,,, Transformer stage ,,, /emo"euplicates ,,,Transformer ,,tgt

    +k

    First /ead and >oad the data into your source file& For 7xample Sequential File '

    5nd in Sort stage select key change column 2 True & To !enerate !roup ids'

    !o to Transformer stage

    (reate one stage "ariable.

    ou can do this by right click in stage "ariable go to properties and name it as your wish& For example temp'

    and in expression write as below

    if keychange column 2= then column name else temp:M$M:column name

    This column name is the one you want in the required column with delimited commas.

    +n remo"e duplicates stage key is col= and set option duplicates retain to,,X >ast.

    in transformer drop col@ and define @ columns like col$col@$colAin col= deri"ation gi"e Field&Input(olumn$W$W$=' and

    in col= deri"ation gi"e Field&Input(olumn$W$W$' and

    in col= deri"ation gi"e Field&Input(olumn$W$W$@'

    Scenario:='(onsider the following employees data as source

    employee4id$ salary,,,,,,,,,,,,,,,,,,,

    =8$ =8888$ 888

  • 7/25/2019 Imp Datastage New

    110/158

    @8$ @888A8$ 888

    (reate a )ob to find the sum of salaries of all employees and this sum should repeat for all

    the rows.

    The output should look like as

    employee4id$ salary$ salary4sum,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

    =8$ =888$ ==8888$ 888$ ==888

    @8$ @888$ ==888A8$ 888$ ==888

    Scenario:

    I ha"e two source tablesHfiles numbered = and .In the the target$ there are three output tablesHfiles$ numbered @$A and .

    The scenario is that$

    to the out put A ,X the records which are common to both = and should go.

    to the output @ ,X the records which are only in = but not in should go

    to the output ,X the records which are only in but not in = should go.

    sltn:src=,,,,,Xcopy=,,,,,,X,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Xoutput4=&only left table'?oin&inner type',,,,X ouput4=

    src,,,,,Xcopy,,,,,,X,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Xoutput4@&only right table'

    (onsider the following employees data as source

    employee4id$ salary,,,,,,,,,,,,,,,,,,,

    =8$ =8888$ 888

    @8$ @888A8$ 888

    Scenario:

    (reate a )ob to find the sum of salaries of all employees and this sum should repeat for allthe rows.

    The output should look like as

    employee4id$ salary$ salary4sum

    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,=8$ =888$ ==888

    8$ 888$ ==888

  • 7/25/2019 Imp Datastage New

    111/158

    @8$ @888$ ==888A8$ 888$ ==888

    sltn:

    Take Source ,,,XTransformer&5dd new (olumn on both the output links and assign a "alueas = ',,,,,,,,,,,,,,,,,,,,,,,,X =' 5ggregator &o group by using that

    new column''lookupH)oin& )oin on that new column',,,,,,,,Xtgt.

    Scenario:

    sno$sname$mark=$mark$mark@=$ra)esh$Y8$E$YD

    $mamatha$@D$A$Y@$an)ali$EY$@D$Y

    A$pa"ani$D$E$A

    $indu$E$EY$Y

    out put issno$snmae$mark=$mark$mark@$delimetercount

    =$ra)esh$Y8$E$YD$A

    $mamatha$@D$A$Y$A@$an)ali$EY$@D$Y$A

    A$pa"ani$D$E$A$A$indu$E$EY$Y$A

    seq,,,Xtrans,,,Xseq

    create one stage "ariable as delimiter..

    and put deri"ation on stage as S>inkA.sno : W$W : S>inkA.sname : W$W : S>inkA.mark= :

    W$W :S>inkA.mark : W$W : S>inkA.mark@

    and do mapping and create one more column count as integer type.

    and put deri"ation on count column as (ount&delimter$ W$W'

    scenario:

    sname total4"owels4count5llen

    Scott =#ard =

    [email protected]$WaW'Z(ount&S>[email protected]$WeW'

    Z(ount&S>[email protected]$WiW'Z(ount&S>[email protected]$WoW'

    Z(ount&S>[email protected]$WuW'.

    Scenario:

  • 7/25/2019 Imp Datastage New

    112/158

    ='+n daily we r getting some huge files data so all files metadata is same we ha"e to load into target table how we can load

    [email protected]='[2A +/

    aysSinceFromate&(urrentate&'$ S>[email protected]='[2AE

    where date4= column is the column ha"ing that date which needs to be less or equal to= months and A is no. of days for = months and for leap year it is AE&these

    numbers you need to check'.

    Bhat is di&&erences -etween Force Compile and Compile ;

    3i&& -'w Compile and 7alidate;

    (ompile option only checks for all mandatory requirements like link requirements$ stage

    options and all. But it will not check if the database connections are "alid.Lalidate is equi"alent to /unning a )ob except for extractionHloading of data. That is$

    "alidate option will test database connecti"ity by making connections to databases.

    %o; to !"nd Out u/lcate alues )sng Trans1ormer

    ou can capture the uplicate recors

  • 7/25/2019 Imp Datastage New

    113/158

    3y source is ,i*e

    Srno) 2ame5@)a

    5@)b7@)c

    L@)dL@)e

    U@)f

    3y target Should ,i*e:

    arget 5:%-nly uni!ue means which records r only once'7@)cU@)f

    arget 7:%Records which r having more than 5 time'5@)a5@)bL@)dL@)e

    1ow to do this in DataStage""""

    ZZZZZZZZZZZZZZ

    use aggregator and transformer stages

    source$$aggregator$$transformat$$target

    perform count in aggregator) and ta*e two op lin*s in trasformer) filter data count5 for one llin*

    and put count5 for second lin*"

    Scenario1

    in my i

  • 7/25/2019 Imp Datastage New

    114/158

    ZZZZZZZZZZZZZZZZZ

    source$$$trans$$$$target

    in trans use conditions on constraints

    mod%empno)L'5

    mod%empno)L'7

    mod%empno)L'@Scenario1

    im having i

  • 7/25/2019 Imp Datastage New

    115/158

    baba@@Ysaid"""

    1ow to Find -ut Duplicate >alues .sing ransformer]

    another way to find the duplicate value can be using a sorter stage before transformer"

    0n sorter: ma*e /luster 9ey change R.+on the 9eythen in ransformer filter the oulput on basic of value of cluste *ey change which can beput in stage variable"

    cenarios_Unix :

    1) Convert single column to single row:Input: flename : try

    R)?R$DR$D?ABA443?CA74DR?4DRD43R;43SB3?4DAR3R

    R$$E$77$?A44RA7

    Output:R)?R$D R$D?AB A443?CA7 4DR?4D RD43 R;43SB3?4D AR3R R$$ E$77$?A44RA7

    Command:cat try F aw- GHprint& IJs I,>1KL

    2) Print the list o employees in Technologydepartment :

    ow department name is available as a &ourth feld, so need to chec- i& >!matches with the strin' I3echnolo'yM, i& yes print the line(

    Command:> aw- G>! N63echnolo'y6L employee(t+t

    2// ;ason Developer 3echnolo'y >","//// San%ay Sysadmin 3echnolo'y >O,///"// Randy DEA 3echnolo'y >#,///

    http://www.blogger.com/profile/13597086736904978295http://www.blogger.com/profile/13597086736904978295
  • 7/25/2019 Imp Datastage New

    116/158

    perator N is &or comparin' with the re'ular e+pressions( $& it matches thede&ault action i(e print whole line will be per&ormed(

    ) Convert single column to multiple column :

    !or eg:$nput fle contain sin'le column with 8! rows then output should

    be sin'le column data converted to multiple o& 12 columns i(e( 12 column P Orows with feld separtor &s =.

    "cript:

    #!/bin/sh

    rows=`cat input_file | wc -l`

    cols=12

    fs=;

    awk -v r=$rows -v c=$cols -v t=$fs '

    NR output_file

    #) $ast feld print:

    input:a56Data6)iles62/102/11(csv

    output:2/102/11(csv

    Command:echo >a F aw- 0)6 GHprint >)KL

    %)Count no5 o# #ields in #ile1fle1:a) b) c) d) 5) 7) man) fruitCommand:cat fle1 F aw- GE$H)S5M,MK=Hprint )KL

    and you will 'et the output as:8

    &) !ind ip address in uni' server:Command:'rep 0i your?hostname 6etc6hosts

    () eplace the word corresponding to searchpattern:

  • 7/25/2019 Imp Datastage New

    117/158

    >cat file

    the black cat was chased by the brown dog.

    the black cat was not chased by the brown dog.

    >sed -e '/not/s/black/white/g' file

    the black cat was chased by the brown dog.

    the white cat was not chased by the brown dog.

    *) The +elow i have shown the demo or the ,-. and ,&%/0

    Ascii value o& character: $t can be done in 2 ways:

    1( print& IJdM IGAM2( echo IAM F tr 0d IQnM F od 0An 0t d4

    4haracter value &rom Ascii: aw- 0v char5#" GE$ H print& IJcQnM, char= e+itKL

    ) Input fle:

    crmplp1 cmis!#1 o nlinecmis!#2 o ine

    crmplp2 cmis!#2 o nlinecmis!# o inecrmplp cmis!# o nlinecmis!#1 o ine

    utput T@crmplp1 cmis!#1 o nline cmis!#2 o inecrmplp2 cmis!#2 o nline cmis!# o ine

    4ommand:aw- GRJ2URS5)S:RS5RSL fle

    1) 3aria+le can used in -45

    aw- 0)M>cM 0v var5M>cM GHprint >1var>2KL flename

    11) "earch pattern and use special character in sed command:

  • 7/25/2019 Imp Datastage New

    118/158

    sed $e H

  • 7/25/2019 Imp Datastage New

    119/158

    5" 1ow to display the 5@th line of a file]head $5@ filename ` tail $5

    7" 1ow to remove the header from a file]

    sed $i 65 d6 filename

    L" 1ow to remove the footer from a file]

    sed $i 6 d6 filename

    U" #rite a command to find the length of a line in a file]

    he below command can be used to get a line from a file"

    sed Mn 6On p6 filename

    #e will see how to find the length of 5@th line in a file

    sed $n 65@ p6 filename`wc $c

    A" 1ow to get the nth word of a line in .ni&]

    cut MfOn $d6 6

    X" 1ow to reverse a string in uni&]

    echo EjavaE ` rev

    Y" 1ow to get the last word from a line in .ni& file]

    echo Euni& is goodE ` rev ` cut $f5 $d6 6 ` rev

    B" 1ow to replace the n$th line in a file with a new line in .ni&]

    sed $i66 65@ d6 filename ^ d stands for delete

    sed $i66 65@ i new inserted line6 filename ^ i stands for insert

    ?" 1ow to chec* if the last command was successful in .ni&]

    echo ]

    5@" #rite command to list all the lin*s from a directory]

    ls $lrt ` grep ElE

    55" 1ow will you find which operating system your system is running on in .20;]

    uname $a

    57" /reate a read$only file in your home directory]

    touch file chmod U@@ file

  • 7/25/2019 Imp Datastage New

    120/158

    5L" 1ow do you see command line history in .20;]

    he 6history6 command can be used to get the list of commands that we are e&ecuted"

    5U" 1ow to display the first 7@ lines of a file]

    4y default) the head command displays the first 5@ lines from a file" 0f we change the option ofhead) then we can display as many lines as we want"

    head $7@ filename

    (n alternative solution is using the sed command

    sed 675) d6 filename

    he d option here deletes the lines from 75 to the end of the file

    5A" #rite a command to print the last line of a file]

    he tail command can be used to display the last lines from a file"

    tail $5 filename

    (lternative solutions are:

    sed $n 6 p6 filename

    aw* 6+2Dprint @T6 filename

    5X" 1ow do you rename the files in a directory with new as suffi&]

    ls $lrt`grep 6$6` aw* 6print Emv E?E E?E"newET6 ` sh

    5Y" #rite a command to convert a string from lower case to upper case]

    echo EappleE ` tr Va$8W V($W

    5B" #rite a command to convert a string to 0nitcap"echo apple ` aw* 6print toupper%substr%5)5)5'' tolower%substr%5)7''T6

    5?" #rite a command to redirect the output of date command to multiple files]he tee command writes the output to multiple files and also displays the output on the terminal"date ` tee $a file5 file7 fileL

    7@" 1ow do you list the hidden files in current directory]

    ls $a ` grep 6"6

    75" ,ist out some of the 1ot 9eys available in bash shell]

    /trl[l $ /lears the Screen"

    /trl[r $ Does a search in previously given commands in shell"

    /trl[u $ /lears the typing before the hot*ey"

    /trl[a $ Places cursor at the beginning of the command at shell"

    /trl[e $ Places cursor at the end of the command at shell"

  • 7/25/2019 Imp Datastage New

    121/158

    /trl[d $ 9ills the shell"

    /trl[8 $ Places the currently running process into bac*ground"

    77" 1ow do you ma*e an e&isting file empty]cat

  • 7/25/2019 Imp Datastage New

    122/158

    7Y" 1ow do you write the contents of L files into a single file]cat file5 file7 fileL file7B" 1ow to display the fields in a te&t file in reverse order]aw* 64+G02 -RSEET for%i2Fi@i$$' print i)E E print EnET6 filename

    7?" #rite a command to find the sum of bytes %si8e of file' of all files in a directory"ls $l ` grep 6$6` aw* 64+G02 sum@T sum sum [ AT +2D print sumT6

    L@" #rite a command to print the lines which end with the word EendE]grep 6end6 filenamehe 66 symbol specifies the grep command to search for the pattern at the end of the line"

    L5" #rite a comm