Data stage.pdf

37
1. What is Modulus and Splitting in Dynamic Hashed File? In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size of the file decreases it is called as "Splitting". The modulus size can be increased by contacting your Unix Admin. 2. Types of vies in Datastage Director? There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c) Status View - Warning Messages, Event Messages, Program Generated Messages. d)Schedule view e) Detail view 3. What are Stage Variables, Derivations and Constants? Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target column. Derivation - Expression that specifies value to be passed on to the target column. Constant - Conditions that are either true or false that specifies flow of data with a link. stage variables :- is the temporary memory area. derivation :- where u apply the business rule. constraints :- where u apply conditions order of execution is :- constrains , derivation, stage variables 4. What is the default cache size? How do you change the cache size if needed? Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. Default read cache size is 128MB. We can increase it by going into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there. 5. Containers: Usage and Types? Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job Specific b) Shared Container: Used in any job within a project. キ There are two types of shared container: キ 1.Server shared container. Used in server jobs (can also be used in parallel jobs). キ 2.Parallel shared container. Use in parallel jobs. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you could use one to make a server plug-in stage available to a parallel job) 6. Types of Parallel Processing? Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing. Then how about Pipeline and Partition Parallelism, are they also 2 types of Parallel processing? 3 types of parallelism. Data Parallelism, pipeline Parallelism, round robin Parallel processing are two types. 1) Pipeline parallel processing 2) Partitioning parallel processing 7. What does a Config File in parallel extender consist of?

description

Datastage Questions with solutions.

Transcript of Data stage.pdf

Page 1: Data stage.pdf

1. What is Modulus and Splitting in Dynamic Hashed File?

In a Hashed File, the size of the file keeps changing randomly.If the size of the file increases it is called as "Modulus".If the size of the file decreases it is called as "Splitting".

The modulus size can be increased by contacting your Unix Admin.

2. Types of vies in Datastage Director?

There are 3 types of views in Datastage Directora) Job View - Dates of Jobs Compiled.

b) Log View - Status of Job last runc) Status View - Warning Messages, Event Messages, Program Generated Messages.d)Schedule viewe) Detail view

3. What are Stage Variables, Derivations and Constants?

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the valueinto target column.

Derivation - Expression that specifies value to be passed on to the target column.

Constant - Conditions that are either true or false that specifies flow of data with a link.

stage variables :- is the temporary memory area. derivation :- where u apply the business rule. constraints:- where u apply conditions

order of execution is :- constrains , derivation, stage variables

4. What is the default cache size? How do you change the cache size if needed?Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the

Tunable Tab and specify the cache size over there. Default read cache size is 128MB. We can increase it bygoing into Datastage Administrator and selecting the Tunable Tab and specify the cache size over there.

5. Containers: Usage and Types?Container is a collection of stages used for the purpose of Reusability.

There are 2 types of Containers.a) Local Container: Job Specificb) Shared Container: Used in any job within a project.

Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) LocalContainer: Job Specific b) Shared Container: Used in any job within a project. · There are two types of sharedcontainer: · 1.Server shared container. Used in server jobs (can also be used in parallel jobs). · 2.Parallel sharedcontainer. Use in parallel jobs. You can also include server shared containers in parallel jobs as a way ofincorporating server job functionality into a parallel stage (for example, you could use one to make a serverplug-in stage available to a parallel job)

6. Types of Parallel Processing?

Parallel Processing is broadly classified into 2 types.a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel Processing.

Then how about Pipeline and Partition Parallelism, are they also 2 types of Parallel processing?

3 types of parallelism. Data Parallelism, pipeline Parallelism, round robin

Parallel processing are two types. 1) Pipeline parallel processing 2) Partitioning parallel processing

7. What does a Config File in parallel extender consist of?

Page 2: Data stage.pdf

Config file consists of the following.a) Number of Processes or Nodes.b) Actual Disk Storage Location.

Config file was read by DataStage engine before running the job in Px. it consist of configuration about yourserver. ex nodes and all .

7. Functionality of Link Partitioner and Link Collector?

Link Partitioner: It actually splits data into various partitions or data flows using various partition methods.Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to target.

server jobs mainly execute the jobs in sequential fashion, the ipc stage as well as link Partitioner and linkcollector will simulate the parallel mode of execution over the sever jobs having single cpu Link Partitioner : Itreceives data on a single input link and diverts the data to a maximum no. of 64 output links and the dataprocessed by the same stage having same meta data Link Collector : It will collects the data from 64 input links,merges it into a single data flow and loads to target. These both r active stages and the design and mode ofexecution of server jobs has to be decided by the designer

8. What is SQL tuning? How do you do it ?

in database using Hints. Sql tuning can be done using cost based optimization. This parameters are veryimportant of pfile sort_area_size, sort_area_retained_size, db_multi_block_count, open_cursors,cursor_sharing. optimizer_mode=choose/role

9. How do you track performance statistics and enhance it?

Through Monitor we can view the performance statistics.

10. What is the order of execution done internally in the transformer with the stage editor having input links onthe left hand side and output links?

Stage variables, constraints and column derivation or expressions.

11. What are the difficulties faced in using DataStage? or what are the constraints in using DataStage ? If thenumber of lookups are more? 2) What will happen, while loading the data due to some regions job aborts?

12 . Differentiate Database data and Data warehouse data?

Data in a Database is a) Detailed or Transactional b) Both Readable and Writable. c) Current. By Database, onemeans OLTP (On Line Transaction Processing). This can be the source systems or the ODS (Operational DataStore), which contains the transactional data.

12. Dimension Modeling types along with their significance

Data Modeling 1) E-R Diagrams 2) Dimensional modeling 2.a) logical modeling 2.b) Physical modeling. DataModeling is broadly classified into 2 types. a) E-R Diagrams (Entity - Relationships). b) Dimensional Modeling.

13. What is the flow of loading data into fact & dimensional tables?

Page 3: Data stage.pdf

Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table.Consists of fields with numeric values.Dimension table - Table with Unique Primary Key.Load - Data should be first loaded into dimensional table. Based on the primary key values in dimensional table,the data should be loaded into Fact table.

Here is the sequence of loading a data warehouse.

1. The source data is first loading into the staging area, where data cleansing takes place.

2. The data from staging area is then loaded into dimensions/lookups.

3.Finally the Fact tables are loaded from the corresponding source tables from the staging area.

14. What r XML files and how do you read data from XML files and what stage to be used?

In the pallet there are Real time stages like xml-input, xml-output, xml-transformer

15. Why do you use SQL LOADER or OCI STAGE?

Data will transfer very quickly to the Data Warehouse by using SQL Loader. When the source data isenormous or for bulk data we can use OCI and SQL loader depending upon the source

16. Suppose if there are million records did you use OCI? If not then what stage do you prefer?

Using Orabulk

17. How do you populate source files?

There are many ways to populate one is writing SQL statement in oracle is one way

18. How do you pass the parameter to the job sequence if the job is running at night?

Two ways

1. Set the default values of Parameters in the Job Sequencer and map these parameters to job.

2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for each parameter.

19. What happens if the job fails at night?

Job Sequence Abort

20. Explain the differences between Oracle8i/9i?

Multiprocessing, databases more dimensional modeling

21. What are Static Hash files and Dynamic Hash files?

Page 4: Data stage.pdf

As the names it suggests what they mean. In general we use Type-30 dynamic Hash files. The Data file has adefault size of 2 GB and the overflow file is used if the data exceeds the 2GB size. The hashed files have thedefault size established by their modulus and separation when you create them, and this can be static ordynamic. Overflow space is only used when data grows over the reserved size for someone of the groups(sectors) within the file. There are many groups as the specified by the modulus.

22. What is Hash file stage and what is it used for?

Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for betterperformance. We can also use the Hash File stage to avoid / remove duplicate rows by specifying the hash keyon a particular field.

23. Did you Parameterize the job or hard-coded the values in the jobs?

Always parameterize the job. Either the values are coming from Job Properties or from a ‘ParameterManager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The oftenParameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be lookedagainst at.

24. What are Sequencers?

Sequencers are job control programs that execute other jobs with preset Job parameters. A sequencer allows youto synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as wellas multiple output triggers. The sequencer operates in two modes: ALL modes. In this mode all of the inputs tothe sequencer must be TRUE for any of the sequencer outputs to fire. ANY mode. In this mode, output triggerscan be fired if any of the sequencer inputs are TRUE

25. What are other Performance tunings you have done in your last project to increase the performance ofslowly running jobs?

1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server usingHash/Sequential files for optimum performance also for data recovery in case job aborts.

2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates andselects.

3. Tuned the 'Project Tunables' in Administrator for better performance.4. Used sorted data for Aggregator.5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs6. Removed the data not used from the source as early as possible in the job.7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the

jobs.9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in

parallel.10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the

standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the caseif the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.

11. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary recordseven getting in before joins are made.

12. Tuning should occur on a job-by-job basis.13. Use the power of DBMS.14. Try not to use a sort stage when you can use an ORDER BY clause in the database.15. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than

using ODBC or OLE.

1. Minimize the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)2. Use SQL Code while extracting the data

Page 5: Data stage.pdf

3. Handle the nulls4. Minimize the warnings5. Reduce the number of lookups in a job design6. Use not more than 20stages in a job7. Use IPC stage between two passive stages Reduces processing time8. Drop indexes before data loading and recreate after loading data into tables9. Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.10. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages

to store the data.11. IPC Stage that is provided in Server Jobs not in Parallel Jobs12. Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this

Option.13. If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance.

Also, check the order of execution of the routines.14. Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.15. Use Preload to memory option in the hash file output.16. Use Write to cache in the hash file input.17. Write into the error tables only after all the transformer stages.18. Reduce the width of the input record - remove the columns that you would not use.19. Cache the hash files you are reading from and writing into. Make sure your cache is big enough to hold the

hash files.20. Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.

This would also minimize overflow on the hash file.21. If possible, break the input into multiple threads and run multiple instances of the job.22. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using

Hash/Sequential files for optimum performance also for data recovery in case job aborts.23. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and

selects.24. Tuned the 'Project Tunables' in Administrator for better performance.25. Used sorted data for Aggregator.26. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs27. Removed the data not used from the source as early as possible in the job.28. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries29. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the

jobs.30. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in

parallel.31. Before writing a routine or a transform, make sure that there is not the functionality required in one of the

standard routines supplied in the sdk or ds utilities categories.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the caseif the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.

32. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary recordseven getting in before joins are made.

33. Tuning should occur on a job-by-job basis.34. Use the power of DBMS.35. Try not to use a sort stage when you can use an ORDER BY clause in the database.36. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….37. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than

using ODBC or OLE.

26. How did you handle reject data?

Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has tobe defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicatesof Primary keys or null-rows where data is expected. We can handle rejected data by collecting them separatelyin sequential file......

27. What are Routines and where/how are they written and have you written any routines before?

Page 6: Data stage.pdf

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit.The following are different types of routines:

1) Transform functions 2) Before-after job subroutines 3) Job Control routines

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view, or editthem using the Routine dialog box. The following program components are classified as routines:• Transformfunctions. These are functions that you can use when defining custom transforms. DataStage has a number ofbuilt-in transform functions which are located in the Routines ➤Examples Functions branch of theRepository. You can also define your own transform functions in the Routine dialog box.• Before/Aftersubroutines. When designing a job, you can specify a subroutine to run before or after the job, or before or afteran active stage. DataStage has a number of built-in before/after subroutines, which are located in the Routines

Built-in Before/After branch in the Repository. You can also define your own before/after subroutinesusing the Routine dialog box.• Custom UniVerse functions. These are specialized BASIC functions that havebeen defined outside DataStage. Using the Routine dialog box, you can get DataStage to create a wrapper thatenables you to call these functions from within DataStage. These functions are stored under the Routinesbranch in the Repository. You specify the category when you create the routine. If NLS is enabled,9-4 AscentialDataStage Designer Guide you should be aware of any mapping requirements when using custom UniVersefunctions. If a function uses data in a particular character set, it is your responsibility to map the data to andfrom Unicode.• ActiveX (OLE) functions. You can use ActiveX (OLE) functions as programming componentswithin DataStage. Such functions are made accessible to DataStage by importing them. This creates a wrapperthat enables you to call the functions. After import, you can view and edit the BASIC wrapper using theRoutine dialog box. By default, such functions are located in the Routines Class name branch in theRepository, but you can specify your own category when importing the functions. When using the ExpressionEditor, all of these components appear under the DS Routines… command on the Suggest Operand menu. Aspecial case of routine is the job control routine. Such a routine is used to set up a DataStage job that controlsother DataStage jobs. Job control routines are specified in the Job control page on the Job Properties dialog box.Job control routines are not stored under the Routines branch in the Repository. Transforms are stored in theTransforms branch of the DataStage Repository, where you can create, view or edit them using the Transformdialog box. Transforms specify the type of data transformed the type it is transformed into, and the expressionthat performs the transformation. DataStage is supplied with a number of built-in transforms (which youcannot edit). You can also define your own custom transforms, which are stored in the Repository and can beused by other DataStage jobs. When using the Expression Editor, the transforms appear under theDSTransform… command on the Suggest Operand menu. Functions take arguments and return a value. Theword “function” is applied to many components in DataStage: • BASIC functions. These are one of thefundamental building blocks of the BASIC language. When using the Expression Editor, Programming inDataStage 9-5you can access the BASIC functions via the Function… command on the Suggest Operand menu.• DataStage BASIC functions. These are special BASIC functions that are specific to DataStage. These aremostly used in job control routines. DataStage functions begin with DS to distinguish them from general BASICfunctions. When using the Expression Editor, you can access the DataStage BASIC functions via the DSFunctions…command on the Suggest Operand menu. The following items, although called “functions,” areclassified as routines and are described under “Routines” on page 9-3. When using the Expression Editor, theyall appear under the DS Routines… command on the Suggest Operand menu. • Transform functions• CustomUniVerse functions• ActiveX (OLE) functions Expressions An expression is an element of code that defines avalue. The word “expression” is used both as a specific part of BASIC syntax, and to describe portions of codethat you can enter when defining a job. Areas of DataStage where you can use such expressions are:• Definingbreakpoints in the debugger• Defining column derivations, key expressions and constraints in Transformerstages• Defining a custom transform In each of these cases the DataStage Expression Editor guides you as towhat programming elements you can insert into the expression.

28. What are OConv () and Iconv () functions and where are they used?

IConv () - Converts a string to an internal storage formatOConv () - Converts an expression to an output format.

Iconv is used to convert the date into internal format i.e. only DataStage can understand

Example: - date coming in mm/dd/yyyy format

Datasatge will convert this date into some number like: - 740

Page 7: Data stage.pdf

u can use this 740 in derive in our own format by using OConv. Suppose u want to change mm/dd/yyyy todd/mm/yyyy now u will use IConv and OConv. OConv (IConv (datecommingfromi/pstring, SOMEXYZ (seein help which is icon format), defineoconvformat))

29. Do u know about METASTAGE?

in simple terms metadata is data about data and MetaStage can be anything like DS(dataset, sq file .etc).MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on.Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and canbe accessed with the use of MetaStage. MetaStage is a metadata repository in which you can store the metadata(DDLs etc.) and perform analysis on dependencies, change impact etc. METASTAGE is datastage's nativereporting tool it contains lots of functions and reports.............

30. Do you know about INTEGRITY/QUALITY stage?

integriry/quality stage is a data integration tool from ascential which is used to staderdize/integrate the datafrom different sources

31. What are the command line functions that import and export the DS jobs?

A. dsimport.exe- imports the DataStage components.B. dsexport.exe- exports the DataStage components.

Parameters: UserName,Password, Hostname, ProjectName, CurrentDirectory (C:/Ascential/ DataStage7.5.1/dsexport.exe),FileName(JobName).

32. What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?

Use crontab utility along with dsexecute() function along with proper parameters passed. "Control_MScheduling Tool": Thru Control_M u can automate the job by invoking the shell script written to schedule thedatastage jobs.

33. What will you in a situation where somebody wants to send you a file and use that file as an input orreference and then run job.

A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you canschedule the sequencer around the time the file is expected to arrive.B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

34. How can we improve the performance of DataStage jobs?

Performance and tuning of DS jobs:

1.Establish Baselines 2.Avoid the Use of only one flow for tuning/performance testing 3.Work in increment

4.Evaluate data skew 5.Isolate and solve 6.Distribute file systems to eliminate bottlenecks 7.Do not involve theRDBMS in intial testing 8.Understand and evaluate the tuning knobs available.

35. what are the Job parameters?

Page 8: Data stage.pdf

These Parameters are used to provide Administrative access and change run time values of the job.EDIT>JOBPARAMETERS. In that Parameters Tab we can define the name,prompt,type,value

36. what is the difference between routine and transform and function?

Difference between Routiens and Transformer is that both are same to pronounce but Routines describes the Businesslogic and Transformer specifies that transform the data from one location to another by applying the changes by usingtransformation rules .

38. What are all the third party tools used in DataStage?

Autosys, TNG, event coordinator are some of them that I know and worked with

39. How can we implement Lookup in DataStage Server jobs?

We can use a Hash File as a lookup in server jobs. The hash file needs atleast one key column to create. by using thehashed files u can implement the lookup in datasatge, hashed files stores data based on hashed algorithm and keyvalues . The DB2 stage can be used for lookups. In the Enterprise Edition, the Lookup stage can be used for doinglookups. In server canvs we can perform 2 kinds of direct lookups . One is by using a hashed file and the other is byusing Database/ODBC stage as a lookup

40. How can we join one Oracle source and Sequential file?.

Join and look up used to join oracle and sequential file

41. What is iconv and oconv functions?

Iconv( )-----converts string to internal storage formatOconv( )----converts an expression to an output format

42. Difference between Hashfile and Sequential File?

Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column.Hash file used as a reference for look up. Sequential file cannot

43. What is DS Administrator used for - did u use it?

The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if NationalLanguage Support (NLS) is enabled, install and manage maps and locales.

44. How do you eliminate duplicate rows?

Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate records, and writes theresults to an output data set. try to use unique function.

45. Dimensional modelling is again sub divided into 2 types.

a)Star Schema - Simple & Much Faster. Denormalized form.b)Snowflake Schema - Complex with more Granularity. More normalized form.

46. How will you call external function or subroutine from datastage?

there is datastage option to call external programs . execSH

46. How do you pass filename as the parameter for a job?

Page 9: Data stage.pdf

While job development we can create a parameter 'FILE_NAME' and the value can be passed while running the job.

1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, whereyou can enter your parameter name and the corresponding the path of the file.

2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name whichyou have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button.Copy the parameter name from the text box and use it in your job. Keep the project default in the text box.

47. How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?

a) "Iconv" function - Internal Convertion. b) "Oconv" function - External Convertion.

Function to convert mm/dd/yyyy format to yyyy-dd-mm is Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

Here is the right conversion: Function to convert mm/dd/yyyy format to yyyy-dd-mm isOconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-YDM[4,2,2]") .

ToChar(%date%, %format%) . This shuld work, in format specify which format u want i.e 'yyyy-dd-mm'

47. Whats difference betweeen operational data stage (ODS) & data warehouse?

that which is volatile is ODS and the data which is nonvolatile and historical and time varient data is DWh data.in simpleterms ods is dynamic data.

A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated,time varient collect of data.

ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 daysinformation . ods is nothing but operational data store is the part of transactional database. this db keeps integrated datafrom different tdb and allow common operations across organisation. eg: banking transaction.

An operational data store (or "ODS") is a database designed to integrate data from multiple sources to facilitateoperations, analysis and reporting. Because the data originates from multiple sources, the integration often involvescleaning, redundancy resolution and business rule enforcement. An ODS is usually designed to contain low level oratomic (indivisible) data such as transactions and prices as opposed to aggregated or summarized data such as netcontributions. Aggregated data is usually stored in the Data warehouse

48. How can we create Containers?

There are Two types of containers 1.Local Container 2.Shared Container

Local container is available for that particular Job only. Where as Shared Containers can be used any where in the project.

Local container: Step1:Select the stages required Step2:Edit>ConstructContainer>Local

SharedContainer: Step1:Select the stages required Step2:Edit>ConstructContainer>Shared

Shared containers are stored in the SharedContainers branch of the Tree Structure. containers r speacial type of jobs in thedatastage that will simplify the job design either in the server or parallel jobs.

2 types of containers 1.Local containers 2 shared containers

Local containers r devoloped & stored within the part of the job. shared containers can be devoloped & stored withinthe repository

shared of 2 types 1server shared containers 2.parallel shared containers

Page 10: Data stage.pdf

49. Importance of Surrogate Key in Data warehousing?

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlyingdatabase. i.e Surrogate Key is not affected by the changes going on with a database.

The concept of surrogate comes into play when there is slowely changing dimension in a table. In such condition there isa need of a key by which we can identify the changes made in the dimensions. These slowely changing dimensions canbe of three type namely SCD1,SCD2,SCD3. These are system genereated key.Mainly they are just the sequence ofnumbers or can be alfanumeric values also. this will be used in the concept of slowly changing dimension. inorder tokeep track of changes in primary key . Surrogate Key should be system generated number and it should be small integer.For each dimension table depending on the SCD and no of total records expected over a 4 years time, you may limit themax number. This will improve the indexing, performance, query processing. surrogate is the systemgenerated key it isa numaric key it is primary key in the dimension table and it is forgien key in the fact table it is used to hadle the missingdata and complex situation in the datastage

50. How do you merge two files in DS?

Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job toconcatenate the 2 files into one if the metadata is different.

DsDeveloper

51. How do we do the automation of dsjobs?

"dsjobs" can be automated by using Shell scripts in UNIX system. We can call Datastage Batch Job from Commandprompt using 'dsjob'. We can also pass all the parameters from command prompt. Then call this shell script in any of themarket available schedulers. The 2nd option is schedule these jobs using Data Stage director.

52. What are types of Hashed File?

Hashed File is classified broadly into 2 types.

a) Static - Sub divided into 17 types based on Primary Key Pattern.b) Dynamic - sub divided into 2 types i) Generic ii) Specific.

Default Hased file is "Dynamic - Type Random 30 D" . Hashed File is classified broadly into 2 types.

a) Static - Sub divided into 17 types based on Primary Key Pattern. b) Dynamic - sub divided into 2 typesi) Generic ii) Specific. Default Hased file is "Dynamic - Type30.

53. How do you eliminate duplicate rows?

delete from from table name where rowid not in(select max/min(rowid)from emp group by column name). Data Stageprovides us with a stage Remove Duplicates in Enterprise edition. Using that stage we can eliminate the duplicates basedon a key column. The Duplicates can be eliminated by loading the corresponding data in the Hash file. Specify thecolumns on which u want to eliminate as the keys of hash . removal of duplicates done in two ways:1. Use "Duplicate Data Removal" stage or 2. use group by on all the columns used in select , duplicates will go away.

54. What about System variables?

DataStage provides a set of variables containing useful system information that you can access from a transform orroutine. System variables are read-only.

@DATE The internal date when the program started. See the Date function.

@DAY The day of the month extracted from the value in @DATE.

@FALSE The compiler replaces the value with 0.

@FM A field mark, Char(254).

Page 11: Data stage.pdf

@IM An item mark, Char(255).

@INROWNUM Input row counter. For use in constrains and derivations in Transformer stages.

@OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.

@LOGNAME The user login name.

@MONTH The current extracted from the value in @DATE.

@NULL The null value.

@NULL.STR The internal representation of the null value, Char(128).

@PATH The pathname of the current DataStage project.

@SCHEMA The schema name of the current DataStage project.

@SM A subvalue mark (a delimiter used in UniVerse files), Char(252).

@SYSTEM.RETURN.CODE Status codes returned by system processes or commands.

@TIME The internal time when the program started. See the Time function.

@TM A text mark (a delimiter used in UniVerse files), Char(251).

@TRUE The compiler replaces the value with 1.

@USERNO The user number.

@VM A value mark (a delimiter used in UniVerse files), Char(253).

@WHO The name of the current DataStage project directory.

@YEAR The current year extracted from @DATE.

REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initiallyTRUE, but is set to FALSE whenever an output link is successfully written.

55. where does unix script of datastage executes weather in clinet machine or in server.suppose if it eexcutes onserver then it will execute ?

Datastage jobs are executed in the server machines only. There is nothing that is stored in the client machine.

56. defaults nodes for datastage parallel Edition

default nodes is allways one. Actually the Number of Nodes depend on the number of processors in your system.If yoursystem is supporting two processors we will get two nodes by default.

57. What Happens if RCP is disable ?

In such case Osh has to perform Import and export every time when the job runs and the processing time job is alsoincreased... Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage whoseoutput connects to the shared container input, then meta data will be propagated at run time, so there is no need to map itat design time. If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the jobruns and the processing time job is also increased.

58. I want to process 3 files in sequentially one by one , how can i do that. while processing the files it shouldfetch files automatically .

Page 12: Data stage.pdf

If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine andcall the job with different file name...or u can create sequencer to use the job...

59. what is difference between data stage and informatica

Here is a very good articles on these differences... which helps to get an idea.. basically it's depends on what you are tringto accomplish. what are the requirements for your ETL tool? Do you have large sequential files (1 million rows, forexample) that need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Thinkabout what process they are going to do. Are they requiring you to load yesterday’s file into a table and do lookups? Ifso, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they arethe right one. It all depends on what you need the ETL to do. If you are small enough in your data sets, then either wouldprobably be OK.

60. what is the OCI? and how to use the ETL Tools?

OCI means orabulk data which used client having bulk data its retrive time is much more ie., your used to orabulk datathe divided and retrived. OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle toload the data. It is kind of the lowest level of Oracle being used for loading the data. OCI means oracle call interface i.ethis acts like native tool to load oracle database . U can just drag the oracle OCI from options and load

61. How can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the databaseconnectivity and then use an adapter for just connecting between the two?.

You need to configure the ODBC connectivity for database (DB2 or AS400) in the datastage. I think there is option to loadDB2 database ,u just drag that & u can use it to load or It is better to use ODBC

61. How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tool. Imean do I first need to use ODBC to create connectivity and use an adapter for the extraction andtransformation of data?

From db2 stage, we can extract the data in ETL. You would need to install ODBC drivers to connect to DB2 instance(does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBCdrivers to connect to DB2) and then try out . if ur system is mainfarmes then u can utility called load and unload .. loadwill load the records into main farme systme from there u hv to export in to your system ( windows)

62. what is merge and how it can be done plz explain with simple example taking 2 tables .......

Merge is used to join two tables.It takes the Key columns sort them in Ascending or descending order.Let us consider twotable i.e Emp,Dept.If we want to join these two tables we are having DeptNo as a common Key so we can give thatcolumn name as key and sort Deptno in ascending order and can join those two tables. Merge stage in used for only Flatfiles in server edition

63. what happends out put of hash file is connected to transformer .. what error it throughs

If u connect output of hash file to transformer ,it will act like reference .there is no errores at all!! It can be used inimplementing SCD's. If Hash file output is connected to transformer stage the hash file will consider as the Lookup file ifthere is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself.you can do SCD in server job by using Lookup functionality. This will not return any error code.

64. What are the Repository Tables in DataStage and What are they?

A datawarehouse is a repository(centralized as well as distributed) of Data, able to answer any adhoc, analytical,historical or complex queries. Metadata is data about data. Examples of metadata include data element descriptions, datatype descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. Therepository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, andnavigation services. Metadata includes things like the name, length, valid values, and description of a data element.Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema ofoperational systems. In data stage I/O and Transfer , under interface tab: input , out put & transfer pages. U will have 4tabs and the last one is build under that u can find the TABLE NAME .The DataStage client components

Page 13: Data stage.pdf

are:Administrator Administers DataStage projects and conducts housekeeping on the server Designer Creates DataStagejobs that are compiled into executable programs Director Used to run and monitor the DataStage jobs Manager Allowsyou to view and edit the contents of the repository.

65. how can we pass parameters to job by using file.

You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has theparameters defined (which are passed by unix) . u can create a UNIX shell script which will pass the parameters to the joband u also can create logs for the whole run process of the job.

66. what is the meaning of the following..

1) If an input file has an excessive number of rows and can be split-up then use standard 2)logic to run jobs in parallel3)Tuning should occur on a job-by-job basis. Use the power of DBMS. Question is not clear eventhough i wil try to answersomething

If u have SMP machines u can use IPC,link-colector,link-partitioner for performance tuning. If u have cluster,MPPmachines u can use parallel jobs. The third point specifies about tuning the performance of job,use the power of DBMSmeans one can improve the performance of the job by using teh power of Database like Analyzing,creating index,creatingpartitions one can improve the performance of sqls used in the jobs.

67. what is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate theunnecessary records even getting in before joins are made?

It probably means that u can put the selection criteria in the where clause,i.e whatever data u need to filter ,filter it outinthe SQL ,rather than carrying it forward and then filtering it out. Constraints is nothing but restrictions to data.here it isrestriction to data at entry itself , as he told it will avoid unnecessary data entry . This means try to improve theperformance by avoiding use of constraints wherever possible and instead using them while selecting the data itself usinga where clause. This improves performance.

68. How can ETL excel file to Datamart?

take the source file(excel file) in the .csv format and apply the conditions which satisfies the datamart. Create a DSN incontrol panel using microsoft excel drivers. U then u can read the excel file from ODBC stage. open the ODBC DataSource Administrator found in the controlpanel/administrative tools. under the system DSN tab, add the Driver toMicrosoft Excel. Then u'll be able to access the XLS file from Datastage.

69. what is difference between server jobs & paraller jobs

Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connectingto other data sources as necessary.

Parallel jobs. These are only available if you have installed Enterprise Edition. These run on DataStage servers that areSMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.

The Parallel jobs are also available if you have Datastage 6.0 PX, or Datastage 7.0 versions installed. The Parallel jobs areespecially usefule if you have large amounts of data to process.

Server jobs: These are compiled and run on DataStage Server

Parallel jobs: These are available only if you have Enterprise Edition installed. These are compiled and run on a DataStageUnix Server, and can be run in parallel on SMP, MPP, and cluster systems.

Server jobs can be run on SMP,MPP machines.Here performance is low i.e speed is less

Parallel jobs can be run only on clu what is merge ?and how to use merge?ster machines .Here performance is high i.espeed is high

Page 14: Data stage.pdf

70. what is merge ?and how to use merge?

merge is nothing but a filter conditions that have been used for filter condition. Merge is a stage that is available in bothparallel and server jobs. The merge stage is used to join two tables(server/parallel) or two tables/datasets(parallel).Merge requires that the master table/dataset and the update table/dataset to be sorted. Merge is performed on a keyfield, and the key field is mandatory in the master and update dataset/table. actually the merge stage in parallel jobmainly used to merge the two or more data sets. it will take one master ds file and n number of update ds files. the outputwill one one final ds file +number of reject ds files as there r update files maily for join. Merge stage is used to merge twoflat files in server jobs. Merge is maily used to join two flat or sequential files in server jobs.

Merge stage is a processing stage, it can have any no of input link and only one output link. It is having master data setand one or more data sets. The out put of the Merge stage is master dataset plus additional column from each update link.Where as Merge in Data stage server you can merge two flat file by specifying their location and name. the output will bejoin of two files. It is also like merge stage but only difference is that how the memory they use.

71. how we use NLS function in Datastage? what are advantages of NLS function? where we can use that one? explainbriefly?

As per the manuals and documents, We have different level of interfaces. Can you be more specific? Like Teradatainterface operators, DB2 interface operators, Oracle Interface operators and SAS-Interface operators. Orchestrate NationalLanguage Support (NLS) makes it possible for you to process data in international languages using Unicode charactersets. International Components for Unicode (ICU) libraries support NLS functionality in Orchestrate. Operator NLSFunctionality* Teradata Interface Operators * switch Operator * filter Operator * The DB2 Interface Operators * The OracleInterface Operators* The SAS-Interface Operators * transform Operator * modify Operator * import and export Operators* generator Operator Should you need any further assistance pls let me know. I shall share as much as i can

By using NLS function we can do the following- Process the data in a wide range of languages- Use Local formats for dates, times and money- Sort the data according to the local rules

If NLS is installed, various extra features appear in the product.For Server jobs, NLS is implemented in DataStage Server engineFor Parallel jobs, NLS is implemented using the ICU library.

72. What is APT_CONFIG in datastage

anyaways, the APT_CONFIG_FILE (not just APT_CONFIG) is the configuration file that defines the nodes, (the scratcharea, temp area) for the specific project. Datastage understands the architecture of the system through thisfile(APT_CONFIG_FILE). For example this file consists information of node names, disk storage information...etc.APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt file that has thenode's information and Configuration of SMP/MMP server.

73. what is NLS in datastage? how we use NLS in Datastage ? what advantages in that ? at the time of installation i amnot choosen that NLS option , now i want to use that options what can i do ? to reinstall that datastage or firstuninstall and install once again ?

NLS is basically Local language setting(characterset) .Once u install the DS u will get NLS present. Just login into Adminand u can set the NLS of your project based on your project requirement. Just need to map the NLS with your project.Suppose if u know u r having file with some greek character.so, if u hav to set the NLS for greek so while running job DSwil recognise those special characters.

74. What is the difference between Datastage and Datastage TX?

Its a critical question to answer, but one thing i can tell u that Datastage Tx is not a ETL tool & this is not a new version ofDatastage 7.5. Tx is used for ODS source ,this much i know

75. If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?

Data will partitioned on both the keys ! hardly it will take more for execution

Page 15: Data stage.pdf

76. If your running 4 ways parallel and you have 10 stages on the canvas, how many processes does datastage create?

Answer is 40 . You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number ofprocesses generated are 40

77. how can you do incremental load in datastage?

You can create a table where u can store the last successfull refresh time for each table/Dimension. Then in the sourcequery take the delta of the last successful and sysdate should give you incremental load. Incremental load means dailyload. when ever you are selecting data from source, select the records which are loaded or updated between thetimestamp of last successful load and todays load start date and time. for this u have to pass parameters for those twodates. store the last run date and time in a file and read the parameter through job parameters and state second argumentas current date and time.

78. Does Enterprise Edition only add the parallel processing for better performance? Are any stages/transformationsavailable in the enterprise edition only?

DataStage Standard Edition was previously called DataStage and DataStage Server Edition. DataStage EnterpriseEdition was originally called Orchestrate, then renamed to Parallel Extender when purchased by Ascential. DataStageEnterprise: Server jobs, sequence jobs, parallel jobs. The enterprise edition offers parallel processing features for scalablehigh volume solutions. Designed originally for Unix, it now supports Windows, Linux and Unix System Services onmainframes. DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs designedusing an alternative set of stages that are generated into cobol/JCL code and are transferred to a mainframe to becompiled and run. Jobs are developed on a Unix or Windows server transferred to the mainframe to be compiled and run.The first two versions share the same Designer interface but have a different set of design stages depending on the type ofjob you are working on. Parallel jobs have parallel stages but also accept some server stages via a container. Server jobsonly accept server stages, MVS jobs only accept MVS stages. There are some stages that are common to all types (such asaggregation) but they tend to have different fields and options within that stage

Row Merger, Row splitter are only present in parallel Stage .

79. How can you implement Complex Jobs in datastage

Complex design means having more joins and more look ups. Then that job design will be called as complex job. We caneasily implement any complex design in DataStage by following simple tips in terms of increasing performance also.There is no limitation of using stages in a job. For better performance, Use at the Max of 20 stages in each job. If it isexceeding 20 stages then go for another job. Use not more than 7 look ups for a transformer otherwise go for includingone more transformer

80. how can u implement slowly changed dimensions in datastage? explain? 2) can u join flat file and database indatastage? how?

Yes, we can join a flat file and database in an indirect way. First create a job which can populate the data from databaseinto a Sequential file and name it as Seq_First. Take the flat file which you are having and use a Merge Stage to join thesetwo files. You have various join types in Merge Stage like Pure Inner Join, Left Outer Join, Right Outer Join etc., You canuse any one of these which suits your requirements SCDs are three typesType 1- Modify the changeType 2- Version themodified changeType 3- Historical versioning of modified change by adding a new column to update the changed data

yeah u can implement SCD's in datastage. SCD type1 just use 'insert rows else update rows' or ' update rows else insertrows' in update action of target

SCD type2 : u have use one hash file to look -up the target ,take 3 instance of target ,give diff condns depending on theprocess,give diff update actions in target ,use system variables like sysdate ,null .We can handle SCD in the followingwaysType I: Just overwrite; Type II: We need versioning and dates; Type III: Add old and new copies of certain importantfields. Hybrid Dimensions: Combination of Type II and Type III. yes you can implement Type1 Type2 or Type 3. Let metry to explain Type 2 with time stamp.

Step :1 time stamp we are creating via shared container. it return system time and one key. For satisfying the lookupcondition we are creating a key column by using the column generator.

Page 16: Data stage.pdf

Step 2: Our source is Data set and Lookup table is oracle OCI stage. by using the change capture stage we will find out thedifferences. the change capture stage will return a value for chage_code. based on return value we will find out whetherthis is for insert , Edit, or update. if it is insert we will modify with current timestamp and the old time stamp will keep ashistory.

81. how to implement routines in data stage, have any one has any material for data stage pl send to me

write the routine in C or C++, create the object file and place object in lib directory. now open disigner and goto routinesconfigure the path and routine names there are 3 kind of routines is there in Datastage.

1.server routines which will used in server jobs. these routines will write in BASIC Language

2.parlell routines which will used in parlell jobs . These routines will write in C/C++ Language

3.mainframe routines which will used in mainframe jobs

82. what is the difference between datastage and informatica

The main difference between data stge and informatica is the SCALABILTY..informatca is scalable than datastage. In myview Datastage is also Scalable, the difference lies in the number of built-in functions which makes DataStage more userfriendly. In my view,Datastage is having less no. of transformers copared to Informatica which makes user to getdifficulties while working . The main difference is Vendors. Each one is having plus from their architecture. For Datastageit is a Top-Down approach. Based on the Businees needs we have to choose products. Main difference lies in parellism ,Datastage uses parellism concept through node configuration, where Informatica does not have used both Datastage andInformatica... In my opinion, DataStage is way more powerful and scalable than Informatica. Informatica has moredeveloper-friendly features, but when it comes to scalabality in performance, it is much inferior as compared to datastage.

Here are a few areas where Informatica is inferior -

1. Partitioning - Datastage PX provides many more robust partitioning options than informatica. You can also re-partitionthe data whichever way you want.

2. Parallelism - Informatica does not support full pipeline parallelism (although it claims).

3. File Lookup - Informatica supports flat file lookup, but the caching is horrible. DataStage supports hash files, lookupfilesets, datasets for much more efficient lookup.

4. Merge/Funnel - Datastage has a very rich functionality of merging or funnelling the streams. In Informatica the onlyway is to do a Union, which by the way is always a Union-all.

83. DataStage from Staging to MDW is only running at 1 row per second! What do we do to remedy?

I am assuming that there are too many stages, which is causing problem and providing the solution.

In general. if you too many stages (especially transformers , hash look up), there would be a lot of overhead and theperformance would degrade drastically. I would suggest you to write a query instead of doing several look ups. It seemsas though embarassing to have a tool and still write a query but that is best at times. If there are too many look ups thatare being done, ensure that you have appropriate indexes while querying. If you do not want to write the query and useintermediate stages, ensure that you use proper elimination of data between stages so that data volumes do not causeoverhead. So, there might be a re-ordering of stages needed for good performance.

Other things in general that could be looked in:

1) for massive transaction set hashing size and buffer size to appropriate values to perform as much as possible inmemory and there is no I/O overhead to disk.

2) Enable row buffering and set appropate size for row buffering

Page 17: Data stage.pdf

3) It is important to use appropriate objects between stages for performance

84. What user variable activity when it used how it used! Where it is used with real example

By using This User variable activity we can create some variables in the job sequence, this variables r available for all theactivities in that sequence. Most probably this activity is @ starting of the job sequence

85. what is the difference between build opts and subroutines ?

Build opts generates c++ code ( oops concept) subroutine :- is normal program and u can call any where in your project.

86. There are three different types of user-created stages available for PX. What are they? Which would you use? Whatare the disadvantages for using each type?

These are the three different stages: i) Custom ii) Build iii) Wrapped

87. What is the exact difference between Join, Merge and Lookup Stage??

The exact difference between Join, Merge and lookup is The three stages differ mainly in the memory they use. DataStagedoesn't know how large your data is, so cannot make an informed choice whether to combine data using a join stage or alookup stage. Here's how to decide which to use:

if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving andreference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. Oncethe sort is over the join processing is very fast and never involves paging or other I/O. Unlike Join stages and Lookupstages, the Merge stage allows you to specify several reject links as many as input links. the concept of merge and join isdifferent in parallel edition as u will not find join component in server merge will survive this purpose. As of myknowledge join and merge both u used to join two files of same structure where lookup u mainly use it for to comparethe prev data and the curr data. We can join 2 relational tables using Hash file only in server jobs. Merge stage is only forflat files . join only max of two input datasets to single output, but merge can have more than two dataset inputs tosingle output. Also remember that to use Merge stage the key's field names MUST be equal in both input files (master andupdates).

88. Can any one tell me how to extract data from more than 1 heterogeneous Sources? Mean, example 1 sequential file,Sybase, Oracle in a single Job.

Yes you can extract the data from two heterogeneous sources in data stages using the transformer stage it's so simple youneed to just form a link between the two sources in the transformer stage that's it . U can convert all heterogeneoussources into sequential files & join them using merge or U can write user defined query in the source itself to join them

89. Can we use shared container as lookup in DataStage server jobs?

We can use shared container as lookup in server jobs. Wherever we can use same lookup in multiple places, on that timewe will develop lookup in shared containers, then we will use shared containers as lookup.

90. How can I specify a filter command for processing data while defining sequential file output data?

We have some thing called as after job subroutine and before subroutine, with then we can execute the UNIX commands.Here we can use the sort command or the filter command

91. What are validations you perform after creating jobs in designer? What r the different type of errors u faced duringloading and how u solve them

Check for Parameters. and check for input files are existed or not and also check for input tables existed or not and alsousernames, data source names, passwords like that

92. If I add a new environment variable in Windows, how can I access it in DataStage?

Page 18: Data stage.pdf

u can call it in designer window . under that job properties there u can add an new environment variable r u can use theexisting one U can view all the environment variables in designer. U can check it in Job properties. U can add and accessthe environment variables from Job properties .

93. what are the enhancements made in datastage 7.5 compare with 7.0

Many new stages were introduced compared to datastage version 7.0. In server jobs we have stored procedure stage,command stage and generate report option was there in file tab. In job sequence many stages like startloop activity, endloop activity,terminate loop activity and user variables activities were introduced. In parallel jobs surrogate key stage,stored procedure stage were introduced. For all other specifications . As of my knowledge the main enhancement i foundis we can generate reports in 7.5 where u can't in 7.0. and also we can import more plug-in stages in7.5 . Complexfile and Surrogate key generator stages are added in Ver 7.5

94. what is data set? and what is file set?

I assume you are referring Lookup fileset only. It is only used for lookup stages only. Dataset: DataStage parallel extenderjobs use data sets to manage data within a job. You can think of each link in a job as carrying a data set. The Data Set stageallows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. FileSet:DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a filewhose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability isuseful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files amongnodes to prevent overruns

95. How the hash file is doing lookup in serverjobs?How is it comparing the key values?

Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed filecontains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the keyvalued the lookup is faster.

96. what are the differences between the data stage 7.0 and 7.5 in server jobs?

There are lot of Differences: There are lot of new stages are available in DS7.5 For Eg: CDC Stage Stored procedure Stageetc..

97. it is possible to run parallel jobs in server jobs?

No. we need UNIX server to run parallel jobs. but we can create a job in windows os PC. No, It is not possible torun Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs. by configuring config file

98. how to handle the rejected rows in datastage?

We can handle by using constraints and store it in file or DB. we can handle rejected rows in two ways with help ofConstraints in a Tansformer.1) By Putting on the Rejected cell where we will be writing our constarints in the propertiesof the Transformer2)Use REJECTED in the expression editor of the ConstraintCreate a hash file as a temporory storage forrejected rows. Create a link and use it as one of the output of the transformer. Apply either ofthe two stpes above said onthat Link. All the rows which are rejected by all the constraints will go to the Hash File.

99. What are orabulk and bcp stages?

these are called as pilu-in stages orabulk is used when v have bulk data in oracle then v go for orabulk for other thanoracle database v go for bcp stages. ORABULK is used to load bulk data into single table of target oracle database. BCP isused to load bulk data into a single table for microsoft sql server and sysbase.

100. how is datastage 4.0 functionally different from the enterprise edition now?? what are the exact changes?

There are lot of Changes in DS EE. CDC Stage, Procedure Stage, Etc..........

101. How I can convert Server Jobs into Parallel Jobs?

Page 19: Data stage.pdf

u cant convert server to parallel ! u have to rebuild whole graph.. There is no machanism to convert server jobs into parlelljobs. u need to re design the jobs in parlell environment using parlell job stages. have never tried doing this, however, Ihave some information which will help you in saving a lot of time. You can convert your server job into a server sharedcontainer. The server shared container can also be used in parallel jobs as shared container.

102. How much would be the size of the database in DataStage ? What is the difference between Inprocess andInterprocess ?

Regarding the database it varies and dependa upon the project and for the second question ,in process is the processwhere teh server transfers only one row at a tiem to target and interprocess means that the server sends group of rows tothe target table...these both are available at the tunables tab page of the administrator client component..

In-process You can improve the performance of most DataStage jobs by turning in-process row buffering on andrecompiling the job. This allows connected active stages to pass data via buffers rather than row by row.

Note: You cannot use in-process row-buffering if your job uses COMMON blocks in transform functions to pass databetween stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering ratherthan COMMON blocks.

Inter-process : Use this if you are running server jobs on an SMP parallel system. This enables the job to run using aseparate process for each active stage, which will run simultaneously on a separate processor.

Note: You cannot inter-process row-buffering if your job uses COMMON blocks in transform functions to pass databetween stages. This is not recommended practice, and it is advisable to redesign your job to use row buffering ratherthan COMMON blocks.

103. Is it possible to move the data from oracle ware house to SAP Warehouse using with DATASTAGE Tool.

We can use DataStage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer the data from oracle toSAP Warehouse. These Plug In Packs are available with DataStage Version 7.5

104. How to implement type2 slowly changing dimensions in data stage? explain with example?

We can handle SCD in the following waysType 1: Just use, “Insert rows Else Update rows” Or “Update rows Else Insert rows”, in update action of target

Type 2: Use the steps as followsa) U have use one hash file to Look-Up the targetb) Take 3 instances of targetc) Give different conditions depending on the processd) Give different update actions in targete) Use system variables like Sysdate and Null.

to develope scd type 2 us the update action in target as " insert new rows only". for this u need to maintain primary key ascomposite key in target table . it would be better to use timestamp column as one of key column in target table.

105. If a DataStage job aborts after say 1000 records, how to continue the job from 1000th record after fixing the error?

If an error is fixed on the job where it failed then job continues leaving that error part. By specifying Check pointing in jobsequence properties, if we restart the job. Then job will start by skipping upto the failed record.this option is available in7.5 edition. The above answer is wrong ,if checkpoint run is selected then it will keep track of failed job and when youstart your job again it will skip the jobs which are run with out erors and restart the failed job,(not from the record whereit si stopped)

106. what is OCI?

If you mean by Oracle Call Interface (OCI), it is a set of low-level APIs used to interact with Oracle databases. It allowsone to use operations like logon, execute, parss etc. using a C or C++ programOracle Call level Interface : Oracle offers a proprietary call interface for C and C++ programmers that allowsmanipulation of data in an Oracle database. Version 9.n of the Oracle Call Interface (OCI) can connect and process SQL

Page 20: Data stage.pdf

statements in the native Oracle environment without needing an external driver or driver manager. To use the Oracle OCI9i stage, you need only to install the Oracle Version 9.n client, which uses SQL*Net to access the Oracle server.Oracle OCI 9i works with both Oracle Version 7.0 and 8.0 servers, provided you install the appropriate Oracle 9i software.With Oracle OCI 9i, you can: Generate your SQL statement. (Fully generated SQL query/Column-generated SQL query) Use a file name to contain your SQL statement. (User-defined SQL file) Clear a table before loading using a TRUNCATE statement. (Clear table) Choose how often to commit rows to the database. (Transaction size) Input multiple rows of data in one call to the database. (Array size) Read multiple rows of data in one call from the database. (Array size) Specify transaction isolation levels for concurrency control and transaction performance tuning. (Transaction Isolation) Specify criteria that data must meet before beingselected. (WHERE clause) Specify criteria to sort, summarize, and aggregate data. (Other clauses) Specify the behavior of parameter marks in SQL statements.

107. what is hashing algorithm and explain breafly how it works?

hashing is key-to-address translation. This means the value of a key is transformed into a disk address by means of analgorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to howwell the algorithms work. It sounds fancy but these algorithms are usually quite simple and use division and remaindertechniques. Any good book on database systems will have information on these techniques.

Interesting to note that these approaches are called "Monte Carlo Techniques" because the behavior of the hashing orrandomizing algorithms can be simulated by a roulette wheel where the slots represent the blocks and the balls representthe records (on this roulette wheel there are many balls not just one). A hashing algorithm takes a variable length datamessage and creates a fixed size message digest. When a one-way hashing algorithm is used to generate the messagedigest the input cannot be determined from the output.. A mathematical function coded into an algorithm that takes avariable length string and changes it into a fixed length string, or hash value.

108. It is possible to call one job in another job in server jobs?

We cannot call one job within another in DataStage, however we can write a wrapper to access the jobs in a statedsequence.We can also use sequencer to sequence the series of jobs. I think we can call a job into another job. In fact callingdoesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or morejobs.

Steps will be Edit --> Job Properties --> Job Control . Click on Add Job and select the desired job.

109. "Will Datastage consider the second constraint in the transformer if the first constraint is satisfied (if linkordering is given)?"

Answer: Yes.

110. What are constraints and derivation? Explain the process of taking backup in DataStage? What are the differenttypes of lookups available in DataStage?

Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a constraint and it meansand only those records meeting this will be processed further. Derivation is a method of deriving the fields, for example ifyou need to get some SUM,AVG etc.

Constraints are condition and once meeting those records will be processed further. Example process all records wherecust_id<>0. Derivations are derived expressions.for example I want to do a SUM of Salary or Calculate Interest rate etc

110. What is a project? Specify its various components?

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connectto a project. Each project contains:

Page 21: Data stage.pdf

DataStage jobs. Built-in components. These are predefined components used in a job. User-defined components. These are customized components created using the DataStage Manager or

DataStage Designer

111. How does DataStage handle the user security?

we have to create users in the Administrators and give the necessary priviliges to users.

112. what is meaning of file extender in data stage server jobs. can we run the data stage job from one job to anotherjob that file data where it is stored and what is the file extender in ds jobs.

file extender means the adding the columns or records to the already existing the file, in the data stage, we can run thedata stage job from one job to another job in data stage.

113. What is the difference between drs and odbc stage

DRS and ODBC stage are similar as both use the Open Database Connectivity to connect to a database. Performance wisethere is not much of a difference.We use DRS stage in parallel jobs.

To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. Youwill need to install and configure the required database clients on your DataStage server for it to work.

Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supportsODBC connections too. Read more of that in the plug-in documentation. ODBC uses the ODBC driver for a particulardatabase, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the nativeconnectivities for the chosen target ...

114. how to use rank&updatestratergy in datastage

Don't mix informatica with Datastage. In Datastage, we dont have such kind of stages . U can use it with ODBC stage bywriting proper SQl quries.

115. What is the max capacity of Hash file in DataStage?

i guess it maximum of 2GB.. Take a look at the uvconfig file:

# 64BIT_FILES - This sets the default mode used to# create static hashed and dynamic files.# A value of 0 results in the creation of 32-bit# files. 32-bit files have a maximum file size of# 2 gigabytes. A value of 1 results in the creation# of 64-bit files (ONLY valid on 64-bit capable platforms).# The maximum file size for 64-bit# files is system dependent. The default behavior# may be overridden by keywords on certain commands.64BIT_FILES 0

116. What is difference between Merge stage and Join stage?

join can have max of two input datasets, Merge can have more than two input datesets. Merge and Join Stage Difference :

1. Merge Reject Links are there 2. can take Multiple Update links 3. If you used it for comparision , then first matchingdata will be the output .

Because it uses the update links to extend the primary details which are coming from master link

Page 22: Data stage.pdf

Someone was saying that join does not support more than two input , while merge support two or more input (onemaster and one or more update links). I will say, that is highly incomplete information. The fact is join does support twoor more input links (left right and possibly intermediate links). But, yes, if you are tallking about full outer join then morethan two links are not supported.

Coming back to main question of difference between Join and Merge Stage, the other significant differences that I havenoticed are:

1) Number Of Reject Link : (Join) does not support reject link. (Merge) has as many reject link as the update links( ifthere are n-input links then 1 will be master link and n-1 will be the update link).

2) Data Selection : (Join) There are various ways in which data is being selected. e.g. we have different types of joins,inner, outer( left, right, full), cross join, etc. So, you have different selection criteria for dropping/selecting a row. (Merge)Data in Master record and update records are merged only when both have same value for the merge key columns.

117. how we can create rank using datastge like in informatica.

if ranking means that below

prop_id rank

1 1

1 2

1 3

2 1

2 1

you can do this first,use sort stage and value of creates the column KeyChange must be set true,it makes data like below

prop_id rank KeyChange()

1 1 1

1 2 0

1 3 0

2 1 1

2 1 0

if value change,keychange column set 1 else set 0,after sort stage, use transformer stage variable .

118. what is the difference between validated ok and compiled in datastage.

When you compile a job, it ensure that basic things like all the important stage parameters has been set, mappings arecorrect, etc. and then it creates an executable job.

You validate a compiled job to make sure that all the connections are valid. All the job parameters are set and a validoutput can be expected after running this job. It is like a dry run where you don't actually play with the live data but you

Page 23: Data stage.pdf

are confident that things will work. When we say "Validating a Job", we are talking about running the Job in the "checkonly" mode. The following checks are made :

- Connections are made to the data sources or data warehouse.- SQL SELECT statements are prepared.- Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local data source are created,if they do not already exist.

119. what are the environment variables in datastage?give some examples?

There are the variables used at the project or job level. We can use them to to configure the job ie.we can associate theconfiguration file(Without this u can not run ur job), increase the sequential or dataset read/ write buffer.

ex: $APT_CONFIG_FILE . Like above we have so many environment variables. Please go to job properties and click on"add environment variable" to see most of the environment variables.

120. purpose of using the key and difference between Surrogate keys and natural key

We use keys to provide relationships between the entities(Tables). By using primary and foreign key relationship, we canmaintain integrity of the data. The natural key is the one coming from the OLTP system. The surrogate key is the artificialkey which we are going to create in the target DW. We can use these surrogate keys instead of using natural key. In theSCD2 scenarions surrogate keys play a major role

natural key :- seq no system assigned .. skey :- user assigend ! u can start with any number like 100,10001,2002

121. How do you do Usage analysis in datastage ?

1. If u want to know some job is a part of a sequence, then in the Manager right click the job and select Usage Analysis. Itwill show all the jobs dependents. 2. To find how many jobs are using a particular table. 3. To find how many jobs areusinga particular routine.

Like this, u can find all the dependents of a particular object. Its like nested. U can move forward and backward and cansee all the dependents.

122. How to remove duplicates in server job

1)Use a hashed file stage or 2) If you use sort command in UNIX(before job sub-routine), you can reject duplicatedrecords using -u parameter or 3)using a Sort stage

Which stages u r using in the Server job. If u r using ODBC stage, then u can write User defined Query in the sourcestage.

123. it is possible to access the same job two users at a time in datastage?

No, it is not possible to access the same job two users at the same time. DS will produce the following error : "Job isaccessed by other user". No chance ..... u have to kill the job process

124. how to find errors in job sequence?

using DataStage Director we can find the errors in job sequence

125. what is job control?how can it used explain with steps

JCL defines Job Control Language it is ued to run more number of jobs at a time with or without using loops. steps:clickon edit in the menu bar and select 'job properties' and enter the parameters asparamete prompt typeSTEP_ID STEP_IDstringSource SRC stringDSN DSN stringUsername unm stringPassword pwd stringafter editing the above steps then setJCL button and select the jobs from the listbox and run the job

126. how we can call the routine in datastage job?explain with steps?

Page 24: Data stage.pdf

Routines are used for impelementing the business logic they are two types 1) Before Sub Routines and 2)After SubRoutinestepsdouble click on the transformer stage right click on any one of the mapping field select [dstoutines] optionwithin edit window give the business logic and select the either of the options( Before / After Sub Routines) .In transformer stage we have to edit the field and click dsRoutines.It will prompt to select the routine

127. what are the different types of lookups in datastage?

- Look-up file stage - Generally used with Look Up stage . - Hash Look-up. - you can also implement a "look up" usingMerge stage . there are two types of lookups lookup stage and lookupfileset Lookup:Lookup refrence to another stage orDatabase to get the data from it and transforms to other database. LookupFileSet: It allows you to create a lookup file setor reference one for a lookup. The stage can have a single input link or a single output link. The output link must be areference link. The stage can be configured to execute in parallel or sequential mode when used with an input link. Whencreating Lookup file sets, one file will be created for each partition. The individual files are referenced by a singledescriptor file, which by convention has the suffix .fs.U can also use the sparse look up property when u have large datain the look up table .....

128. where actually the flat files store?what is the path?

Flat files stores the data and the path can be given in general tab of the sequential file stageNormally flat file will be stored at FTP servers or local folders and more over .CSV , .EXL and .TXT file formats availablefor Flat files. The flat files will be stored in the unix box ,if u r environment is Unix,U need to specify the path in theproperties of the sequential file stage.... u can parameterise the path

129. how to find the number of rows in a sequential file?

Using Row Count system variable

130. how to implement type2 slowly changing dimenstion in datastage? give me with example?

Slow changing dimension is a common problem in Dataware housing. For example: There exists a customer called lisa ina company ABC and she lives in New York. Later she she moved to Florida. The company must modify her address now.In general 3 ways to solve this problem .

Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record is added intothe customer dimension table. Therefore, the customer is treated essentially as two different people. Type 3: The originalrecord is modified to reflect the changes. In Type1 the new one will over write the existing one that means no history ismaintained, History of the person where she stayed last is lost, simple to use.

In Type2 New record is added, therefore both the original and the new record Will be present, the new record will get itsown primary key, Advantage of using this type2 is, Historical information is maintained But size of the dimension tablegrows, storage and performance can become a concern. Type2 should only be used if it is necessary for the datawarehouse to track the historical changes.

In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current value. example anew column will be added which shows the original address as New york and the current address as Florida. Helps inkeeping some part of the history and table size is not increased. But one problem is when the customer moves fromFlorida to Texas the new york information is lost. so Type 3 should only be used if the changes will only occur for a finitenumber of time. it is where the data is to be stored in the intermediate files

131. what is the purpose of exception activity in data stage 7.5?

It is used to catch the exception raised while running the job. The stages followed by exception activity will be executedwhenever there is an unknown error occurs while running the job sequencer.

132. What is the difference between sequential file and a dataset? When to use the copy stage?

Page 25: Data stage.pdf

Sequential file stores small amount of the data with any extension .txt where as DataSet stores Huge amount of the dataand opens the file only with an extension .ds.

Sequentiial Stage stores small amount of the data with any extension in order to acces the file where as DataSet is used tostore Huge amount of the data and it opens only with an extension (.ds ) .The Copy stage copies a single input data set toa number of output datasets. Each record of the input data set is copied to every output data set.Records can be copiedwithout modification or you can drop or change theorder of columns.

Main difference b/w sequential file and dataset is : Sequential stores small amount of data and stores normally.Butdataset load the data like ansi format.

sequential file: it act as a source & permanent storage for target.it 's extend is .txt.

dataset: it act as a temporary storage stage ,mainly it used before the target stage . while using this stage the ip datas rpartited &convert into internal dataset format.then it is easy to load the data in target stage .

copy: it act as a placeholder.it has a single ip &many o/p .

if u want 2 add a new stage in ur job at that time it is very easy otherwise u have to modify that whole job.

133. where we use link partitioner in data stage job?explain with example?

We use Link Partitioner in DataStage Server Jobs.The Link Partitioner stage is an active stage which takes one inputandallows you to distribute partitioned rows to up to 64 output links.

134. how to kill the job in data stage?

By killing the respective process ID . You should use kill -14 so the job ends nicely. Sometimes use -9 leaves things ina bad state. U can also do it by using data stage director clean up resources

135. How to parametarise a field in a sequential file?I am using Datastage as ETL Tool,Sequential file as source.

We cannot parameterize a particular field in a sequential file, instead we can parameterize the source file name in asequential file.

#FILENAME#

136. how to drop the index befor loading data in target and how to rebuild it in data stage?

This can be achieved by "Direct Load" option of SQLLoaded utily.

137. If the size of the Hash file exceeds 2GB..What happens? Does it overwrite the current rows

it overwrites the file

138. Other than Round Robin, What is the algorithm used in link collecter? Also Explain How it will works?

Other than round robin, the other algorithm is Sort/Merge. Using the sort/merge method the stage reads multiplesorted inputs and writes one sorted output.

139. how to improve the performance of hash file?

You can inprove performance of hashed file by

1 .Preloading hash file into memory -->this can be done by enabling preloading options in hash file output stage

2. Write caching options -->.It makes data written into cache before being flushed to disk.you can enable this toensure that hash files are written in order onto cash before flushed to disk instead of order in which individual rowsare written

Page 26: Data stage.pdf

3 .Preallocating--> Estimating the approx size of the hash file so that file need not to be splitted to often after writeoperation

140. what is the size of the flat file?

The flat file size depends amount of data contained by that flat file

141. what is data stage engine?what is its purpose?

Datastage sever contains Datastage engine DS Server will interact with Client components and Repository. Use of DSengine is to develope the jobs .Whenever the engine is on then only we will develope the jobs.

142. What is the difference between Symetrically parallel processing,Massively parallel processing?

Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicate viashared memory and have single operating system.

Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive accessto hardware resources. CLuster systems can be physically dispoersed.The processor have their own operatins system andcommunicate via high speed network

Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor communicate viashared memory and have single operating system.

Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have exclusive accessto hardware resources. CLuster systems can be physically dispoersed.The processor have their own operatins system andcommunicate via high speed network

Symmetric MultiProcessing (SMP) is the processing of programs by multiple processors that share a commom operatingsystem and memory. This SMP is also called as "Tightly Coupled MultiProcessing". A Single copy of the OperatingSystem is in charge for all the Processors Running in an SMP. This SMP Methodolgy dosen't exceed more than 16Processors. SMP is better than MMP systems when Online Transaction Processing is Done, in which many users canaccess the same database to do a search with a relatively simple set of common transactions. One main advantage of SMPis its ability to dynamically balance the workload among computers ( As a result Serve more users at a faster rate )

Massively Parallel Processing (MPP)is the processsing of programs by multiple processors that work on different parts ofthe program and share different operating systems and memories. These Different Processors which run , communicatewith each other through message interfaces. There are cases in which there are upto 200 processors which run for a singleapplication. An InterConnect arrangement of data paths allows messages to be sent between different processors whichrun for a single application or product. The Setup for MPP is more complicated than SMP. An Experienced ThoughtProcess should to be applied when u setup these MPPand one shold have a good indepth knowledge to partition thedatabase among these processors and how to assign the work to these processors. An MPP system can also be called as aloosely coupled system. An MPP is considered better than an SMP for applications that allow a number of databases to besearched in parallel.

143. give one real time situation where link partitioner stage used?

If we want to send more data from the source to the targets quickly we will be using the link partioner stage in the serverjobs we can make a maximum of 64 partitions. And this will be in active stage. We can't connect two active stages but it isaccpeted only for this stage to connect to the transformer or aggregator stage. The data sent from the link partioner will becollected by the link collector at a max of 64 partition. This is also an active stage so in order to aviod the connection ofactive stage from the transformer to teh link collector we will be using inter process communication. As this is a passivestage by using this data can be collected by the link collector. But we can use inter process communication only when thetarget is in passive stage

144. What does separation option in static hash-file mean?

The different hashing algorithms are designed to distribute records evenly among the groups of the file based oncharecters and their position in the record ids. When a hashed file is created, Separation and Modulo respectively

Page 27: Data stage.pdf

specifies the group buffer size and the number of buffers allocated for a file. When a Static Hashfile is created,DATASTAGE creates a file that contains the number of groups specified by modulo.

Size of Hashfile = modulus(no. groups) * Separations (buffer size)

145. How do you remove duplicates without using remove duplicate stage?

In the target make the column as the key column and run the job. Using a sort stage,set property: ALLOW DUPLICATES:false . Just do a hash partion of the input data and check the options Sort and Unique.

146. How do you call procedures in datastage?

Use the Stored Procedure Stage

147. How can we create read only jobs in Datastage.

in export there is an options just CLICK ON OPTIONS TAB THEN THERE UNDER INCLUDE OPTIONU WILL FINDREAD ONLY DATASTAGE u just enable that

148. How to run the job in command prompt in unix?

Using dsjob command, -options dsjob -run -jobstatus projectname jobname

What is the difference between Transform and Routine in DataStage?

Transformar transform the data from one from to another form .where as Routines describes the business logic

149. how do u clean the datastage repository.

REmove log files periodically..... CLEAR.FILE &PH&

150. what is the transaction size and array size in OCI stage?how these can be used?

Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and later of thePlug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handlingtab on the Input page.

Rows per transaction - The number of rows written before a commit is executed for the transaction. The defaultvalue is 0, that is, all the rows are written before being committed to the data table.

Array Size - The number of rows written to or read from the database at a time. The default value is 1, that is, eachrow is written in a separate statement.

151. How to know the no.of records in a sequential file before running a server job?

if your environment is unix , you can check with wc -l filename command.

152. My requirement is like this :Here is the codification suggested:SALE_HEADER_XXXXX_YYYYMMDD.PSVSALE_LINE_XXXXX_YYYYMMDD.PSVXXXXX = LVM sequence to ensure unicity and continuity of file exchangesCaution, there will an increment to implement.YYYYMMDD = LVM date of file creationCOMPRESSION AND DELIVERY TO: SALE_HEADER_XXXXX_YYYYMMDD.ZIP ANDSALE_LINE_XXXXX_YYYYMMDD.ZIP

Page 28: Data stage.pdf

if we run that job the target file names are like this sale_header_1_20060206 & sale_line_1_20060206.If we run next time means the target files we like this sale_header_2_20060206 & sale_line_2_20060206.If we run the same in next day means the target files we want like this sale_header_3_20060306 & sale_line_3_20060306.i.e., whenever we run the same job the target files automatically changes its filename tofilename_increment to previous number(previousnumber + 1)_currentdate;

This can be done by using unix script

1. Keep the Target filename as constant name xxx.psv2. Once the job completed, invoke the Unix Script through After job routine - ExecSh3. The script should get the number used in previous file and increment it by 1, After that move the file from

xxx.psv to filename_(previousnumber + 1)_currentdate.psv and then delete the xxx.psv file.This is the Easiestway to implement.

154. how to distinguish the surogate key in different dimensional tables?how can we give for differentdimension tables?

Use Database sequence to make your job easier to generate the surrogate key.

155. how to find the process id?explain with steps?

you can find it in UNIX by using ps -ef command it displays all the process currently running on the system alongwith the process ids . From the DS Director.Follow the path : Job > Cleanup Resources.

There also you can see the PID.It also displays all the current running processes. Depending on your environment,you may have lots of process id's.From one of the datastage docs:you can try this on any given node: $ ps -ef |grep dsuserwhere dsuser is the account for datastage.If the above (ps command) doesn't make sense, you'll needsomebackground theory about how processes work in unix (or the mksenvironment when running in windows).Also from the datastage docs (I haven't tried this one yet, but it looks interesting) APT_PM_SHOW_PIDS - If thisvariable is set, players will output an informational message uponstartup, displaying their process id

U can also use Data stage Administrator.Just click on the project and execute command ,just follow the menu joiceto get the job name and PID .then kill the process in the unix ,but for this u will require the user name of thedatastage in which the process is locked

155. what is quality stage and profile stage?

Quality Stage:It is used for cleansing ,Profile stage:It is used for profiling

156. How can I schedule the cleaning of the file &PH& by dsjob?

Create a job with dummy transformer and sequentail file stage. In Before Job subroutine, use ExecTCL to executethe following command. CLEAR.FILE &PH&

157. if we using two sources having same meta data and how to check the data in two sorces is same or not?and if thedata is not same i want to abort the job ?how we can do this?

Use a change Capture Stage.Output it into a Transformer. Write a routine to abort the job which is initiated at theFunction. @INROWNUM = 1. So if the data is not matching it is passed in the transformer and the job is aborted

158. what is difference betweend ETL and ELT?

ETL usually scrubs the data then loads into the Datamart or Data Warehouse where as ELT Loads the data thenuse the RDMBS to scrub and reload into the Datamart or Datawarehouse .ETL = Extract >>> Transform >>> LoadELT = Extract >>> Load >>> Transform. ETL-> transformation takes place in staging area . and in ELT->transormation takes at either source side r target side............

Page 29: Data stage.pdf

158. Can you tell me for what puorpse .dsx files are used in the datasatage

.dsx is the standard file extension of all the various datastage jobs.Whenever we export a job or a sequence, the file isexported in the .dsx format. A standard usage for the same can be that, we develop the job in our test environment andafter testing we export the file and save it as x.dsx . This can be done using Datstage Manager.you can as well export the Datastage jobs in .xml format......

159. What is environment variables?what is the use of this?

Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Projectlevel or Job level.Once we set specific variable that variable will be availabe into the project/job. We can also define newenvrionment variable.For that we can got to DS Admin . I hope u understand.for further details refer the DS Adminguide.

159. How to write and execute routines for PX jobs in c++?

You define and store the routines in the Datastage repository(ex:in routine folder). And these rountines are excuted onc++ compilers. You have to write routine in C++ (g++ in Unix). then you have to create a object file. provide this object filepath in your routine.

160. have few questions1. What are the various process which starts when the datastage engine starts?2. What are the changes need to be done on the database side, If I have to use dB2 stage?3. datastage engine is responsible for compilation or execution or both?

There are three processes start when the DAtastage engine starts: 1. DSRPC 2. Datastage Engine Resources 3. Datastagetelnet Services.

161. What is the difference between reference link and straight link ?

The differerence between reference link and straight link is

The straight link is the one where data are passed to next stage directly and the reference link is the one where it showsthat it has a reference(reference key) to the main table. for example in oracle EMP table has reference with DEPT table.

In DATASTAGE, 2 table stage as source (one is straight link and other is reference link) to 1 transformer stage asprocess.If 2 source as file stage(one is straight link and other is reference link to Hash file as reference) and 1 transformerstage.

162. What is Runtime Column Propagation and how to use it?

If your job has more columns which are not defined in metadata if runtime propagation is enabled it will propagate thoseextra columns to the rest of the job

163. Can both Source system(Oracle,SQLServer,...etc) and Target Data warehouse(may be oracle,SQLServer..etc)can be on windows environment or one of the system should be in UNIX/Linux environment.

Your Source System can be (Oracle, SQL, DB2, Flat File... etc) But your Target system for complete Data Warehouseshould be one (Oracle or SQL or DB2 or..) . In server edition you can have both in Windows. But in PX target should be inUNIX.

164. Is there any difference b/n Ascential DataStage and DataStage.

There is no difference between Ascential Datastage and Datastage ,Now its IBM websphere Datastage earlier it wasAscential Datastage and IBM has bought it and named it as above.

165. how can we test the jobs?

Page 30: Data stage.pdf

Testing of jobs can be performed at many different levels: Unit testing, SIT and UAT phases. Testing basically involvesfunctionality and performance tests. Firstly data for the job needs to be created to test the functionality. By changing thedata we will see whether the requirements are met by the existing code. Every iteration of code change should beaccompanied by a testing iteration.

Performance tests basically involve load tests and see how well the exisiting code performance in a finite period of time.Performance tuning can be performed on sql or the job design or the basic/osh code for faster processing times.Inaddition all job designs should include a error correction and fail over support so that the code is robust.

166. What is the use of Hash file??insted of hash file why can we use sequential file itself?

hash file is used to eliminate the duplicate rows based on hash key,and also used for lookups.data stage not allowed touse sequential file as lookup. Actually the primary use of the hash file is to do a look up. You can use a sequential file forlook up but you need to write your own routine to match the columns. Coding time and execution time will be moreexpensive. But when you generate a hash file the hash file indexes the key by an inbuilt hashing algorithm. so when alook up is made is much much faster. Also it eliminates the duplicate rows. these files are stored in the memory hencefaster performance than from a sequential file

167. what is a routine?

Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The followingare different types of routines: 1) Transform functions 2) Before-after job subroutines 3) Job Control routines.Routine is user defined functions that can be reusable with in the project.

168. what is the difference between static hash files n dynamic hash files?

Static hash file don't chane their number of groups(modulas) except through manual resizing. Dynamic hash fileautomatically change their no of groups(modulas)in response to the amount of data stored ina file.

169. how can we create environment variables in datasatage?

We can create environment variables by using DataStage Administrator. This mostely will comes under Administratorpart.As a Designer only we can add directly byDesigner-view-jobprops-parameters-addenvironment variable-underuserdefined-then add.

170. how can we load source into ODS? What is ur source?. Depending on type of source, you have to use respectivestage.

like oracle enterprise: u can use this for oracle source and target. similarly for other sources.

171. how to eleminate duplicate rows in data stage?

TO remove duplicate rows you can achieve by more than one way .

1.In DS there is one stage called "Remove Duplicate" is exist where you can specify the key.

2.Other way you can specify the key while using the stage i mean stage itself remove the duplicate rows based on keywhile processing time.

By using Hash File Stage in DS Server we can elliminate the Duplicates in DS. Using a sort stage,set property: ALLOWDUPLICATES :false OR You can use any Stage in input tab choose hash partition And Specify the key and Check theunique checkbox. if u r doing with server Jobs, V can use hashfile to eliminate duplicate rows.

There are two methods for eleminating duplicate rows in datastage

1. Using hash file stage (Specify the keys and check the unique checkbox, Unique Key is not allowed duplicate values)

Page 31: Data stage.pdf

2. Using Sort stage by link remove duplicate stage

172. what is pivot stage?why are u using?what purpose that stage will be used?

Pivot stage is used to make the horizontal rows into vertical and viceversa.

Pivot stage supports only horizontal pivoting – columns into rows. Pivot stage doesn’t supports vertical pivoting – rowsinto columns/

Example: In the below source table there are two cols about quarterly sales of a product but biz req. as target shouldcontain single col. to represent quarter sales, we can achieve this problem using pivot stage, i.e. horizontal pivoting.

Source Table

ProdID Q1_Sales Q2_Sales

1010 123450 234550

Target Table

ProdID Quarter_Sales Quarter

1010 123450 Q1

1010 234550 Q2

173. what is complex stage?In which situation we are using this one?

CFF stage is used to read the files in ebcidic format.mainly main frame files with redifines. A complex flat file can be usedto read the data at the intial level. By using CFF, we can read ASCII or EBCIDIC data. We can select the required columnsand can omit the remaining. We can collect the rejects (bad formatted records) by setting the property of rejects to "save"(other options: continue, fail). We can flatten the arrays(COBOL files).

174. what are the main diff between server job and parallel job in datastage

in server jobs we have few stages and its mainly logical intensive and we r using transformer for most of the things and itdoes not uses MPP systems . in paralell jobs we have lots of stages and its stage intensive and for particular thing we havein built stages in parallel jobs and it uses MPP systems

In server we dont have an option to process the data in multiple nodes as in parallel. In parallel we have an advatage toprocess the data in pipelines and by partitioning, whereas we dont have any such concept in server jobs.

There are lot of differences in using same stages in server and parallel. For example, in parallel, a sequencial file or anyother file can have either an input link or an output ink, but in server it can have both(that too more than 1). server jobscan compile and run with in datastage server but parallel jobs can compile and run with in datastage unix server.

server jobs can extact total rows from source to anthor stage then only that stage will be activate and passing the rows intotarget level or dwh.it is time taking. but in parallel jobs it is two types 1.pipe line parallelisam 2.partion parallelisam

1.based on statistical performence we can extract some rows from source to anthor stage at the same time the stage will beactivate and passing the rows into target level or dwh.it will maintain only one node with in source and target.

2.partion parallelisam will maintain more than one node with in source and target.

175. differentiate between pipeline and partion parallelism?

Page 32: Data stage.pdf

consider three cpu connected in series. When data is being fed into the first one,it start processing, simultaneously isbeing transferred into the second cpu and so on. u can compare with 3 section of pipe. as water enter s the pipe it startmoving into all the section of pipe.

Partition Pipeline- conside 3 cpu connected in parallel and being fed with data at same time thus reduces the load andefficiency. you can compare a single big pipe having 3 inbuilt pipe. As water is being fed to them it consumes largequantity in less time.

176. how to read the data from XL FILES?my problem is my data file having some commas in data,but we are usingdelimitor is| ?how to read the data ,explain with steps?

Create DSN for your XL file by picking Microsoft Excel Driver. 2. Take ODBC as source stage 3. Configure ODBC withDSN details 4. While importing metadata for XL sheet, make sure you should select on system tables check box.

Note: In XL sheet the first line should be column names.

If the problem is only commas in XL file data.. We can open it in Access and save the file with Pipe (|) separator... thancan be used as simple sequential file but change the dilimiter to (|).. in the format tab .

177. Disadvantages of staging area

disadvantage of staging are is disk space as we have to dump data into a local area.. As per my knowledge concern, thereis no other disadvantages of staging area.

178. whats the meaning of performance tunning techinque,Example??

Meaning of performance tuning meaning we have to take some action to increase performance of slowly running job by

1) use link partitioner and link collector to speedup performance 2) use sorted data for aggregation 3) use sorter atsource side and aggregation at target side 4)Tuned the oci stage for 'Array Size' and 'Rows per Transaction' numericalvalues for faster inserts, updates and selects. 5) do not use ipc stage at target side..............

is this only related with server jobs .because in parallel extender these things are taken care by stages

179. how to distinguish the surrogate key in different dimentional tables?

the Surrogate key will be the key field in the dimensions.

180. how to read the data from XL FILES?explain with steps?

Reading data from Excel file is . * Save the file in .csv (comma separated files). * use a flat file stage in datastage job panel.* double click on the flat file stage and assign input file to the .csv file (which you stored ). * import metadate for the file .(once you imported or typed metadata , click view data to check the data values) . Then do the rest transformation asneeded

Create a new DSN for the Excel driver and choose the workbook from which u want data Select the ODBC stage and access the Excel through that i.e., import the excel sheet using the new DSN created

for the Excel

181. how can we generate a surrogate key in server/parallel jobs?

In parallel jobs we can use surrogatekey generator stage. in server jobs we can use an inbuilt routine calledKeyMgtGetNextValue. You can also generate the surrogate key in the database using the sequence generator.

182. what is an environment variable??

Page 33: Data stage.pdf

Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Projectlevel or Job level.Once we set specific variable that variable will be availabe into the project/job. We can also define newenvrionment variable.For that we can got to DS Admin . Theare are the variables used at the project or job level.We canuse them to to configure the job ie.we can associate the configuration file(Wighout this u can not run ur job), increase thesequential or dataset read/ write buffer.

ex: $APT_CONFIG_FILE

Like above we have so many environment variables. Please go to job properties and click on Paramer tab then clickon "add environment variable" to see most of the environment variables.

183. what are different types of file formats??

comma delimited csv files. tab delimited text files... csv files. dxs files( standard extension of data stage)

184. For what purpose is the Stage Variable is mainly used?

Stage variable is temporary storage memory variable, if we are doing caluculations repeatedly the result,we can store instage variable. The stage variable can be used in situations where U want to Store a Previous record value in a variableand compare with current record value and use if then else conditional statements. If you want to show the product listseperated , for a each manufacturer with the following rows you can use stage variable

Input Rows

Manufacturer Product

GM Chevvy

GM GeoMetro

Ford Focus

Ford Explorer

Chrysler Jeep

Chrysler Pacifica

Output rows

GM Chevvy,GeoMetro

Ford Focus,Explorer

Chrysler Jeep,Pacifica

185. Where can you output data using the Peek Stage?

In datastage Director! Look at the datastage director LOg

186. how can we improve the job performance?

in many ways we can improve,one simple method is by inserting IPC stage between two active stage or two passivestages..... there r lots of techniques for performance tuning as u asked ipc should b inserted btn two active stages . Some ofthe tips to be followed to improve the performance in DS parallel jobs-

Page 34: Data stage.pdf

1. Do right partitioning at right parts of the job, avoid re-partitioning of data as much as possible.

2. Sort the data before aggregating or removing duplicates.

3. use transformer and pivot stages limitedly.

4. Try to develop small simple jobs, rather than huge complex ones.

5. Study and decide in which curcumstances a join or merge can be used and in which a lookup can be used.

Instead of having a job with fork join creates two jobs. These two jobs will perform better then the single job in most of thecases. Even if you want to you fork join style then use proper partitioning and sorting for the stages.

We can put hash file in lookup. This will index the input data based on key column (which we define). Thus improveperformance. Also Array size can be increased in final table stage.

187. what are two types of hash files??

the two types of hash files r static hash file and dynamic hash file................. The most commonly used hash File istype 32 Dynamic hash files and we use hash files in server jobs... the two type of hash file are, 1) static and 2)dynamic,,,,, the dynamic hash file is again subdivided in to Generic and Specific

188. Why job sequence is use for? what is batches? what is the difference between job sequence and batches?

Job Sequence is allows you to specify a sequence of server or parallel jobs to run. The sequence can also contain controlinformation, for example, you can specify different courses of action to take depending on whether a job in the sequencesucceeds or fails. Once you have defined a job sequence, it can be scheduled and run using the DataStage Director. Itappears in the DataStage Repository and in the DataStage Director client as a job.

 Batch is a collection of jobs group together to perform a specific task.i.e It is s special type of job created using Datastage director which can be sheduled to run at specific time.

Difference between Sequencers and Batches: Un like as in sequencers in batches we can not provide the controlinformation.Â

189. What is Integrated & Unit testing in DataStage ?

Unit Testing: In Datastage senario Unit Testing is the technique of testing the individual Datastage jobs for itsfunctionality.

Integrating Testing: When the two or more jobs are collectively tested for its functionality that is callled Integratingtesting.

190. how can we improve performance in aggregator stage??

For improving the performance when you use aggregator stage sort the data before u pass pass to the aggregator stage.Select the most appropriate partitioning method based on the data analysis. Hash partitoning performs well in most of thecases.

191. Why is hash file is faster than sequential file n odbc stage??

Hash file is indexed. Also it works under hashing algo. That's why the search is faster in hash file.

192. Is it possible to query a hash file? Justify your answer...

No its not possible to query a hash file . The reason being its a backend file and not a datatbase which can be queried .

Page 35: Data stage.pdf

193. What does # indicate in environment variables?

It is used to identify the parameter.

194. What is user activity in datastage?

The User variable activity stage defines some variables,those are used in the sequence in future.....

195. What is the alternative way where we can do job control??

Job Control will possble Through scripting. Controling is dependent on Reqirements.need of the job.

196. What is the use of job controle??

Job control is used for scripting. With the help of scripting, we can set parameters for a caller job, execute it, do errorhandling etc tasks.

197. What are different types of star schema??

Multi star schema or galaxy schema is one of the type of star schema

198. What is the sequencer stage??

Lets say there are two jobs (J1 & J2) as the input links and one job (J3) as output link for a sequencer stage in a DS jobsequencer. The sequencer can be set to "ALL" or "ANY". If it is set to "ALL", the sequencer triggers the third job (J3) onlyafter the 2 input linked jobs (J1 & J2) completes and if it is set to "ANY", it just waits for any of the job (J1 or J2)to complete and triggers the third one.

199. What is the use of tunnable??

tunables are the tab in datastage administartor by which one can increase or decrease the cashe size . Tunable is projectproperty in Datastage Administrator, in that we can change the value of cache size i.e. between 0 to 999 mb,

200. Which partition we have to used for Aggregate Stage in parallel jobs ?

By default this stage allows Auto mode of partitioning. The best partitioning is based on the operating mode of this stageand preceding stage. If the aggregator is operating in sequential mode, it will first collect the data and before writing it tothe file using the default Auto collection method. If the aggregator is in parallel mode then we can put any type ofpartitioning in the drop down list of partitioning tab. Generally auto or hash can be used

201. What is Fact loading, how to do it?

firstly u have to run the hash-jobs, secondly dimensional jobs and lastly fact jobs.

202. What is the difference betwen Merge Stage and Lookup Stage?

Merge stage : The parallel job stage that combines data sets

lookup stage: The mainframe processing jobs and parallel active stages that perfom table lookups.

Lookup stage:1. Used to perform lookups.2. Multiple reference links, single input link, single output link, single rejectslink, single primary link. 3. Large amount of memory usage. Because paging required5. Data on input links or referencelinks need NOT to be sorted. Merge stage:1. Combines the sorted data sets with the update datasets. 2. Several reject links,multiple output links will be exist. 3. Less memory usage.4. Data need to be sorted.

203. How to run a job using command line?

Page 36: Data stage.pdf

dsjob -run -jobstatus projectname jobname

204. Suppose you have table "sample" & three columns in that table

sample:

Cola Colb Colc1 10 1002 20 2003 30 300Assume: cola is primary keyHow will you fetch the record with maximum cola value using data stage tool into the target system

As per the question it is very clear that the source data is in Table . you can use oci stage to read the source file in the ociStage write user defined sql query as " Select Max(cola) from the table" which will fetch the maximum value available inthe table then load the data to Target Table

205. To run the job through command line

Below given are syntax for running datastge jobs through command line.

Command Syntax:dsjob [-file <file> <server> | [-server <server>][-user <user>][-password

<password>]]<primary command> [<arguments>]

Valid primary command options are:-run-stop-lprojects-ljobs-linvocations-lstages-llinks-projectinfo-jobinfo-stageinfo-linkinfo-lparams-paraminfo-log-logsum-logdetail-lognewest-report-jobid-import

Status code = -9999 DSJE_DSJOB_ERROR

dsjob -run[-mode <NORMAL | RESET | VALIDATE>][-param <name>=<value>][-warn <n>][-rows <n>][-wait][-opmetadata <TRUE | FALSE>][-disableprjhandler][-disablejobhandler][-jobstatus][-userstatus][-local][-useid]<project> <job|jobid>

Status code = -9999 DSJE_DSJOB_ERROR

Page 37: Data stage.pdf