Infa Parameters Variables&RealTime Session Config

Using Parameters, Variables and Parameter Files

Using Parameters, Variables and Parameter FilesChallengeUnderstanding how parameters, variables, and parameter files work and using them for maximum efficiency.

DescriptionPrior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific transformations and to those server variables that were global in nature. Transformation variables were defined as variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression, Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect the subdirectories for source files, target files, log files, and so forth.

More current versions of PowerCenter made variables and parameters available across the entire mapping rather than for a specific transformation object. In addition, they provide built-in parameters for use within WorkflowManager. Using parameter files, these values can change from session-run to session-run. With the addition of workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility and reducing parameter file maintenance. Other important functionality that has been added in recent releases is the ability to dynamically create parameter files that can be used in the next session in a workflow or in other workflows.

Parameters and Variables

Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or session. A parameter file can be created using a text editor such as WordPad or Notepad. List the parameters or variables and their values in the parameter file. Parameter files can contain the following types of parameters and variables:

Workflow variablesWorklet variablesSession parametersMapping parameters and variables

When using parameters or variables in a workflow, worklet, mapping, or session, the Integration Service checks the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for these parameters and variables, the Integration Service checks for the start value of the parameter or variable in other places.

Session parameters must be defined in a parameter file. Because session parameters do not have default values, if the Integration Service cannot locate the value of a session parameter in the parameter file, it fails to initialize the session. To include parameter or variable information for more than one workflow, worklet, or session in a single parameter file, create separate sections for each object within the parameter file.Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks

use, as necessary. To specify the parameter file that the Integration Service uses with a workflow, worklet, or session, do either of the following:

Enter the parameter file name and directory in the workflow, worklet, or session properties.Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in the command line.

If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd command line, the Integration Service uses the information entered in the pmcmd command line.

Parameter File Format

When entering values in a parameter file, precede the entries with a heading that identifies the workflow, worklet or session whose parameters and variables are to be assigned. Assign individual parameters and variables directly below this heading, entering each parameter or variable on a new line. List parameters and variables in any order for each task.

The following heading formats can be defined:

Workflow variables - [folder name.WF:workflow name]Worklet variables -[folder name.WF:workflow name.WT:worklet name]Worklet variables in nested worklets - [folder name.WF:workflow name.WT:worklet name.WT:worklet name...]Session parameters, plus mapping parameters and variables - [folder name.WF:workflow name.ST:session name] or [folder name.session name] or [session name]

Below each heading, define parameter and variable values as follows:

parameter name=valueparameter2 name=valuevariable name=valuevariable2 name=value

For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $$State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The session also uses session parameters to connect to source files and target databases, as well as to write session log to the appropriate session log file. The following table shows the parameters and variables that can be defined in the parameter file:

Parameter and Variable Type

Parameter and Variable Name

Desired Definition

String Mapping Parameter

$$State MA

Datetime Mapping Variable

$$Time 10/1/2000 00:00:00

Source File (Session Parameter)

$InputFile1 Sales.txt

Database Connection (Session Parameter)

$DBConnection_Target Sales (database connection)

Session Log File (Session Parameter)

$PMSessionLogFile d:/session logs/firstrun.txt

The parameter file for the session includes the folder and session name, as well as each parameter and variable:

[Production.s_MonthlyCalculations]$$State=MA$$Time=10/1/2000 00:00:00$InputFile1=sales.txt$DBConnection_target=sales$PMSessionLogFile=D:/session logs/firstrun.txt

The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable. This allows the Integration Service to use the value for the variable that was set in the previous session run

Mapping VariablesDeclare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to creating a port in most transformations (See the second figure, below).

Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect change to mapping variables:

SetVariableSetMaxVariableSetMinVariableSetCountVariable

A mapping variable can store the last value from a session run in the repository to be used as the starting value for the next session run.

Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.Aggregation type. This entry creates specific functionality for the variable and determines how it stores data. For example, with an aggregation type of Max, the value stored in the repository at the end of each session run would be the maximum value across ALL records until the value is deleted.Initial value. This value is used during the first session run when there is no corresponding and overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value is identified, then a data-type specific default value is used.

Variable values are not stored in the repository when the session:

Fails to complete.Is configured for a test load.Is a debug session.Runs in debug mode and is configured to discard session output.

Order of Evaluation

The start value is the value of the variable at the start of the session. The start value can be a value defined in the parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined initial value for the variable, or the default value based on the variable data type. The Integration Service looks for the start value in the following order:

1. Value in session parameter file2. Value saved in the repository3. Initial value4. Default value

Mapping Parameters and Variables

Since parameter values do not change over the course of the session run, the value used is based on:

Value in session parameter fileInitial valueDefault value

Once defined, mapping parameters and variables can be used in the Expression Editor section of the following transformations:

ExpressionFilterRouterUpdate StrategyAggregator

Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined join, and source filter sections, as well as in a SQL override in the lookup transformation.

Guidelines for Creating Parameter FilesUse the following guidelines when creating parameter files:

Enter folder names for non-unique session names. When a session name exists more than once in a repository, enter the folder name to indicate the location of the session.Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions individually. Specify the same parameter file for all of these tasks or create several parameter files.If including parameter and variable information for more than one session in the file, create a new section for each session. The folder name is optional.

[folder_name.session_name]parameter_name=value

variable_name=value

mapplet_name.parameter_name=value

[folder2_name.session_name]parameter_name=value

variable_name=value


Specify headings in any order. Place headings in any order in the parameter file. However, if defining the same parameter or variable more than once in the file, the Integration Service assigns the parameter or variable value using the first instance of the parameter or variable.Specify parameters and variables in any order. Below each heading, the parameters and variables can be specified in any order.When defining parameter values, do not use unnecessary line breaks or spaces. The Integration Service may interpret additional spaces as part of the value.List all necessary mapping parameters and variables. Values entered for mapping parameters and variables become the start value for parameters and variables in a mapping. Mapping parameter and variable names are not case sensitive.List all session parameters. Session parameters do not have default values. An undefined session parameter can cause the session to fail. Session parameter names are not case sensitive.Use correct date formats for datetime values. When entering datetime values, use the following date formats:

MM/DD/RR

MM/DD/RR HH24:MI:SS

MM/DD/YYYY

MM/DD/YYYY HH24:MI:SS

Do not enclose parameters or variables in quotes. The Integration Service interprets everything after the equal sign as part of the value.Do enclose parameters in single quotes. In a Source Qualifier SQL Override use single quotes if the parameter represents a string or date/time value to be used in the SQL Override.Precede parameters and variables created in mapplets with the mapplet name as follows:


mapplet2_name.variable_name=value

Sample: Parameter Files and Session Parameters

Parameter files, along with session parameters, allow you to change certain values between sessions. A commonly-used feature is the ability to create user-defined database connection session parameters to reuse sessions for different relational sources or targets. Use session parameters in the session properties, and then define the parameters in a parameter file. To do this, name all database connection session parameters with the prefix $DBConnection, followed by any alphanumeric and underscore characters. Session parameters and parameter files help reduce the overhead of creating multiple mappings when only certain attributes of a mapping need to be changed.

Using Parameters in Source QualifiersAnother commonly used feature is the ability to create parameters in the source qualifiers, which allows you to reuse the same mapping, with different sessions, to extract specified data from the parameter files the session references. Moreover, there may be a time when it is necessary to create a mapping that will create a parameter file and the second mapping to use that parameter file created from the first mapping. The second mapping pulls the data using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter file created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a parameter file for another session to use.

Sample: Variables and Parameters in an Incremental Strategy

Variables and parameters can enhance incremental strategies. The following example uses a mapping variable, an expression transformation object, and a parameter file for restarting.

Scenario

Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new information. The environment data has an inherent Post_Date that is defined within a column named Date_Entered that can be used. The process will run once every twenty-four hours.

Sample Solution

Create a mapping with source and target objects. From the menu create a new mapping variable named $$Post_Date with the following attributes:

TYPE VariableDATATYPE Date/TimeAGGREGATION TYPE MAXINITIAL VALUE 01/01/1900

Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE(--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the Integration Service to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.

The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the function for setting the variable will reside. An output port named Post_Date is created with a data type of date/time. In the expression code section, place the following function:

SETMAXVARIABLE($$Post_Date,DATE_ENTERED)The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed forward. For example:

DATE_ENTERED ResultantPOST_DATE9/1/2000 9/1/200010/30/2001 10/30/2001

9/2/2000 10/30/2001

Consider the following with regard to the functionality:

1. In order for the function to assign a value, and ultimately store it in the repository, the port must be connected to a downstream object. It need not go to the target, but it must go to another Expression Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream transformation object.

2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not work. In this case, make the session Data Driven and add an Update Strategy after the transformation containing the SETMAXVARIABLE function, but before the Target.

3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is preserved.

The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900 providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it encounters. Upon successful completion of the session, the variable is updated in the repository for use in the next session run. To view the current value for a particular variable associated with the session, right-click on the session in the Workflow Monitor and choose View Persistent Values.

The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.

Resetting or Overriding Persistent Values

To reset the persistent value to the initial value declared in the mapping, view the persistent value from WorkflowManager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing the Order of Evaluation to use the Initial Value declared from the mapping.

If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:

Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A session may (or may not) have a variable, and the parameter file need not have variables and parameters defined for every session using the parameter file. To override the variable, either change, uncomment, or delete the variable in the parameter file.Run pmcmd for that session, but declare the specific parameter file within the pmcmd command.

Configuring the Parameter File Location

Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in the workflow or session properties:

Select either the Workflow or Session, choose, Edit, and click the Properties tab.Enter the parameter directory and name in the Parameter Filename field.Enter either a direct path or a server variable directory. Use the appropriate delimiter for the Integration Service operating system.

The following graphic shows the parameter filename and location specified in the session task.

The next graphic shows the parameter filename and location specified in the Workflow.

In this example, after the initial session is run, the parameter file contents may look like:

[Test.s_Incremental];$$Post_Date=

By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a simple Perl script or manual change can update the parameter file to:

[Test.s_Incremental]$$Post_Date=04/21/2001

Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value and uses that value for the session run. After successful completion, run another script to reset the parameter file.

Sample: Using Session and Mapping Parameters in Multiple Database Environments

Reusable mappings that can source a common table definition across multiple databases, regardless of differing environmental definitions (e.g., instances, schemas, user/logins), are required in a multiple database environment.

Scenario

Company X maintains five Oracle database instances. All instances have a common table definition for sales orders, but each instance has a unique instance name, schema, and login.

DBInstance Schema Table User PasswordORC1 aardso orders Sam maxORC99 environ orders Help meHALC hitme order_done Hi LoisUGLY snakepit orders Punch JudyGORF gmer orders Brer Rabbit

Each sales order table has a different name, but the same definition:

ORDER_ID NUMBER (28) NOT NULL,DATE_ENTERED DATE NOT NULL,DATE_PROMISED DATE NOT NULL,DATE_SHIPPED DATE NOT NULL,EMPLOYEE_ID NUMBER (28) NOT NULL,CUSTOMER_ID NUMBER (28) NOT NULL,SALES_TAX_RATE NUMBER (5,4) NOT NULL,STORE_ID NUMBER (28) NOT NULL

Sample Solution

Using Workflow Manager, create multiple relational connections. In this example, the strings are named according to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then create a Mapping Parameter named $$Source_Schema_Table with the following attributes:

Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required since this solution uses parameter files.

Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.

Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns. Override the table names in the SQL statement with the mapping parameter.

Using Workflow Manager, create a session based on this mapping. Within the Source Database connection drop-down box, choose the following parameter:

$DBConnection_Source. Point the target to the corresponding target and finish.

Now create the parameter files. In this example, there are five separate parameter files.

Parmfile1.txt

[Test.s_Incremental_SOURCE_CHANGES]$$Source_Schema_Table=aardso.orders$DBConnection_Source= ORC1

Parmfile2.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=environ.orders$DBConnection_Source= ORC99

Parmfile3.txt

[Test.s_Incremental_SOURCE_CHANGES]$$Source_Schema_Table=hitme.order_done$DBConnection_Source= HALC

Parmfile4.txt

[Test.s_Incremental_SOURCE_CHANGES]$$Source_Schema_Table=snakepit.orders$DBConnection_Source= UGLY

Parmfile5.txt

[Test.s_Incremental_SOURCE_CHANGES]$$Source_Schema_Table= gmer.orders$DBConnection_Source= GORF

Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular parameter file is as follows:

pmcmd startworkflow -s serveraddress:portno -u Username -p Password -paramfile parmfilename s_Incremental

You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual password.

Notes on Using Parameter Files with Startworkflow

When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter Integration Service runs the workflow using the parameters in the file specified. For UNIX shell users, enclose the parameter file name in single quotes:

-paramfile '$PMRootDir/myfile.txt'For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the name includes spaces, enclose the file name in double quotes:

-paramfile "$PMRootDir\my file.txt"

Note: When writing a pmcmd command that includes a parameter file located on another machine, use the backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the server variable.

pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\$PMRootDir/myfile.txt'

In the event that it is necessary to run the same workflow with different parameter files, use the following five separate commands:

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile2.txt 1 1pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile3.txt 1 1pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile4.txt 1 1pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile5.txt 1 1

Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script can change the parameter file for the next session.

Dynamically creating Parameter Files with a mapping

Using advanced techniques a PowerCenter mapping can be built that produces as a target file a parameter file (.parm) that can be referenced by other mappings and sessions. When many mappings use the same parameter file it is desirable to be able to easily re-create the file when mapping parameters are changed or updated. This also can be beneficial when parameters change from run to run. There are a few different methods of creating a parameter file with a mapping.

There is a mapping template example on the my.informatica.com that illustrates a method of using a PowerCenter mapping to source from a process table containing mapping parameters and to create a parameter file. This same feat can be accomplished also by sourcing a flat file in a parameter file format with code characters in the fields to be altered.

[folder_name.session_name]parameter_name=

variable_name=value


[folder2_name.session_name]parameter_name=

variable_name=value


In place of the text one could place the text filename_.dat. The mapping would then perform a string replace wherever the text occurred and the output might look like:

Src_File_Name= filename_20080622.dat

This method works well when values change often and parameter groupings utilize different parameter sets. The overall benefits of using this method are such that if many mappings use the same parameter file, changes can be made by updating the source table and recreating the file. Using this process is faster than manually updating the file line by line.

Final Tips for Parameters and Parameter Files

Use a single parameter file to group parameter information for related sessions.

When sessions are likely to use the same database connection or directory, you might want to include them in the same parameter file. When connections or directories change, you can update information for all sessions by editing one parameter file. Sometimes you reuse session parameters in a cycle. For example, you might run a session against a sales database everyday, but run the same session against sales and marketing databases once a week. You can create separate parameter files for each session run. Instead of changing the parameter file in the session properties each time you run the weekly session, use pmcmd to specify the parameter file to use when you start the session.

Use reject file and session log parameters in conjunction with target file or target database connection parameters.

When you use a target file or target database connection parameter with a session, you can keep track of reject files by using a reject file parameter. You can also use the session log parameter to write the session log to the target machine.

Use a resource to verify the session runs on a node that has access to the parameter file.

In the Administration Console, you can define a file resource for each node that has access to the parameter file and configure the Integration Service to check resources. Then, edit the session that uses the parameter file and assign the resource. When you run the workflow, the Integration Service runs the session with the required resource on a node that has the resource available.

Save all parameter files in one of the process variable directories.

If you keep all parameter files in one of the process variable directories, such as $SourceFileDir, use the process variable in the session property sheet. If you need to move the source and parameter files at a later date, you can update all sessions by changing the process variable to point to the new directory.

Session and Data Partitioning

Session and Data PartitioningChallengeImproving performance by identifying strategies for partitioning relational tables, XML, COBOL and standard flat files, and by coordinating the interaction between sessions, partitions, and CPUs. These strategies take advantage of the enhanced partitioning capabilities in PowerCenter.

DescriptionOn hardware systems that are under-utilized, you may be able to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity.

In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.

Assumptions

The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.

Indexing has been implemented on the partition key when using a relational source.Source files are located on the same physical machine as the PowerCenter Server process when partitioning flat files, COBOL, and XML, to reduce network overhead and delay.All possible constraints are dropped or disabled on relational targets.All possible indexes are dropped or disabled on relational targets.Table spaces and database partitions are properly managed on the target system.Target files are written to same physical machine that hosts the PowerCenter process in order to reduce network overhead and delay.Oracle External Loaders are utilized whenever possible

First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics:

Check idle time and busy percentage for each thread. This gives the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the reader, transformation, and writer threads:

Total Run TimeTotal Idle TimeBusy Percentage

Under-utilized or intermittently-used CPUs. To determine if this is the case, check the CPU usage of your machine. The column ID displays the percentage utilization of CPU idling during the specified interval without any

I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance may be improved by adding a partition.

Windows 2000/2003 - check the task manager performance tab.UNIX - type VMSTAT 1 10 on the command line.

Sufficient I/O. To determine the I/O statistics:

Windows 2000/2003 - check the task manager performance tab.UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that the CPU spends idling (i.e., the unused capacity of the CPU.)

Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To determine if the session is paging:

Windows 2000/2003 - check the task manager performance tab.UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page space during the specified interval. PO displays the number of pages swapped out to the page space during the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more memory, if possible.

If you determine that partitioning is practical, you can begin setting up the partition.

Partition Types

PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you must specify a partition type. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:

Round-robin Partitioning

The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need to distribute rows evenly and do not need to group data among partitions.

In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.

Source file 1: 100,000 rowsSource file 2: 5,000 rowsSource file 3: 20,000 rows

In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes approximately one third of the data.

Hash Partitioning

The PowerCenter Server applies a hash function to a partition key to group data among partitions.

Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter

Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key.

An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data based on a primary key are processed in the same partition.

Key Range Partitioning

With this type of partitioning, you specify one or more ports to form a compound partition key for a source or target. The PowerCenter Server then passes data to each partition depending on the ranges you specify for each port.

Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. Refer to Workflow Administration Guide for further directions on setting up Key range partitions.

For example, with key range partitioning set at End range = 2020, the PowerCenter Server passes in data where values are less than 2020. Similarly, for Start range = 2020, the PowerCenter Server passes in data where values are equal to greater than 2020. Null values or values that may not fall in either partition are passed through the first partition.

Pass-through Partitioning

In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point without redistributing them.

Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to (or cannot) change the distribution of data across partitions. The Data Transformation Manager spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.

It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread.

When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative process of adding partitions. Continue adding partitions to the session until you meet the desired performance threshold or observe degradation in performance.

Tips for Efficient Session and Data PartitioningAdd one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more information on Restrictions on the Number of Partitions.Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value for the non-partitioned session.Set cached values for sequence generator. For a session with n partitions, there is generally no need to use the Number of Cached Values property of the sequence generator. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the non-partitioned session.Partition the source data evenly. The source data should be partitioned into equal sized chunks for each partition.Partition tables. A notable increase in performance can also be realized when the actual source and target tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the setup of tablespaces.Consider using external loader. As with any session, using an external loader may increase session

performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning.Write throughput. Check the session statistics to see if you have increased the write throughput.Paging. Check to see if the session is now causing the system to page. When you partition a session and there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches. When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for each partition. If the memory is not bumped up, the system may start paging to disk, causing degradation in performance.

When you finish partitioning, monitor the session to see if the partition is degrading or improving session performance. If the session performance is improved and the session meets your requirements, add another partition

Session on Grid and Partitioning Across NodesSession on Grid (provides the ability to run a session on multi-node integration services. This is most suitable for large-size sessions. For small and medium size sessions, it is more practical to distribute whole sessions to different nodes using Workflow on Grid. Session on Grid leverages existing partitions of a session b executing threads in multiple DTMs. Log service can be used to get the cumulative log. See PowerCenter Enterprise Grid Option for detailed configuration information.

Dynamic PartitioningDynamic partitioning is also called parameterized partitioning because a single parameter can determine the number of partitions. With the Session on Grid option, more partitions can be added when more resources are available. Also the number of partitions in a session can be tied to partitions in the database to facilitate maintenance of PowerCenter partitioning to leverage database partitioning.

Real-Time Integration with PowerCenter

Real-Time Integration with PowerCenterChallengeConfigure PowerCenter to work with various PowerExchange data access products to process real-time data. This Best Practice discusses guidelines for establishing a connection with PowerCenter and setting up a real-time session to work with PowerCenter.

DescriptionPowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter supports the following types of real-time data:

Messages and message queues. PowerCenter with the real-time option can be used to integrate third-party messaging applications using a specific PowerExchange data access product. Each PowerExchange product supports a specific industry-standard messaging application, such as WebSphere MQ, JMS, MSMQ, SAP NetWeaver, TIBCO, and webMethods. You can read from messages and message queues and write to messages, messaging applications, and message queues. WebSphere MQ uses a queue to store and exchange data. Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the message exchange is identified using a topic.Web service messages. PowerCenter can receive a web service message from a web service client through the Web Services Hub, transform the data, and load the data to a target or send a message back to a web service client. A web service message is a SOAP request from a web service client or a SOAP response from the Web Services Hub. The Integration Service processes real-time data from a web service client by receiving a message request through the Web Services Hub and processing the request. The Integration Service can send a reply back to the web service client through the Web Services Hub or write the data to a target.Changed source data. PowerCenter can extract changed data in real time from a source table using the PowerExchange Listener and write data to a target. Real-time sources supported by PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL Server, Oracle and VSAM.

Connection SetupPowerCenter uses some attribute values in order to correctly connect and identify the third-party messaging application and message itself. Each PowerExchange product supplies its own connection attributes that need to be configured properly before running a real-time session.

Setting Up Real-Time Session in PowerCenter

The PowerCenter real-time option uses a zero latency engine to process data from the messaging system. Depending on the messaging systems and the application that sends and receives messages, there may be a period when there are many messages and, conversely, there may be a period when there are no messages. PowerCenter uses the attribute Flush Latency to determine how often the messages are being flushed to the target. PowerCenter also provides various attributes to control when the session ends.

The following reader attributes determine when a PowerCenter session should end:

Message Count - Controls the number of messages the PowerCenter Server reads from the source before the session stops reading from the source.Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it stops reading from the source.Time Slice Mode - Indicates a specific range of time that the server read messages from the source. Only PowerExchange for WebSphere MQ uses this option.Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading messages from the source.

The specific filter conditions and options available to you depend on which Real-Time source is being used. For example -Attributes for PowerExchange for DB2 for i5/OS:

Set the attributes that control how the reader ends. One or more attributes can be used to control the end of session.

For example, set the Reader Time Limit attribute to 3600. The reader will end after 3600 seconds. The idle time limit is set to 500 seconds. The reader will end if it doesnt process any changes for 500 seconds (i.e., it remains idle for 500 seconds).If more than one attribute is selected, the first attribute that satisfies the condition is used to control the end of session.

Note:: The real-time attributes can be found in the Reader Properties for PowerExchange for JMS, TIBCO, webMethods, and SAP iDoc. For PowerExchange for WebSphere MQ , the real-time attributes must be specified

as a filter condition.

The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often PowerCenter should flush messages, expressed in milli-seconds.

For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition is reached. The Source Based Commit condition is defined in the Properties tab of the session.

The message recovery option can be enabled to ensure that no messages are lost if a session fails as a result of unpredictable error, such as power loss. This is especially important for real-time sessions because some messaging applications do not store the messages after the messages are consumed by another application.

A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on the source system from an external application. Each UOW may consist of a different number of rows depending on the transaction to the source system. When you use the UOW Count Session condition, the Integration Service commits source data to the target when it reaches the number of UOWs specified in the session condition.

For example, if the value for UOW Count is 10, the Integration Service commits all data read from the source after the 10th UOW enters the source. The lower you set the value, the faster the Integration Service commits data to the target. The lower value also causes the system to consume more resources.

Executing a Real-Time Session

A real-time session often has to be up and running continuously to listen to the messaging application and to process messages immediately after the messages arrive. Set the reader attribute Idle Time to -1 and Flush Latency to a specific time interval. This is applicable for all PowerExchange products except for PowerExchange for WebSphere MQ where the session continues to run and flush the messages to the target using the specific flush latency interval.

Another scenario is the ability to read data from another source system and immediately send it to a real-time target. For example, reading data from a relational source and writing it to WebSphere MQ. In this case, set the session to run continuously so that every change in the source system can be immediately reflected in the target.

A real-time session may run continuously until a condition is met to end the session. In some situations it may be required to periodically stop the session and restart it. This is sometimes necessary to execute a post-session command or run some other process that is not part of the session. To stop the session and restart it, it is useful to deploy continuously running workflows. The Integration Service starts the next run of a continuous workflow as soon as it completes the first.

To set a workflow to run continuously, edit the workflow and select the Scheduler tab. Edit the Scheduler and select Run Continuously from Run Options. A continuous workflow starts automatically when the Integration Service initializes. When the workflow stops, it restarts immediately.

Real-Time Sessions and Active Transformations

Some of the transformations in PowerCenter are active transformations, which means that the number of input rows and output rows of the transformations are not the same. For most cases, active transformation requires all of the input rows to be processed before processing the output row to the next transformation or target. For a real-time session, the flush latency will be ignored if DTM needs to wait for all the rows to be processed.

Depending on user needs, active transformations, such as aggregator, rank, sorter can be used in a real-time session by setting the transaction scope property in the active transformation to Transaction. This signals the session to process the data in the transformation every transaction. For example, if a real-time session is using an aggregator that sums a field of an input, the summation will be done per transaction, as opposed to all rows. The

result may or may not be correct depending on the requirement. Use the active transformation with real-time session if you want to process the data per transaction.

Custom transformations can also be defined to handle data per transaction so that they can be used in a real-time session.

PowerExchange Real Time ConnectionsPowerExchange NRDB CDC Real Time connections can be used to extract changes from ADABAS, DATACOM, IDMS, IMS and VSAM sources in real time.

The DB2/390 connection can be used to extract changes for DB2 on OS/390 and the DB2/400 connection to extract from AS/400. There is a separate connection to read from DB2 UDB in real time.

The NRDB CDC connection requires the application name and the restart token file name to be overridden for every session. When the PowerCenter session completes, the PowerCenter Server writes the last restart token to a physical file called the RestartToken File. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it last left off. Every PowerCenter session needs to have a unique restart token filename.

Informatica recommends archiving the file periodically. The reader timeout or the idle timeout can be used to stop a real-time session. A post-session command can be used to archive the RestartToken file.

The encryption mode for this connection can slow down the read performance and increase resource consumption. Compression mode can help in situations where the network is a bottleneck; using compression also increases the CPU and memory usage on the source system.

Archiving PowerExchange TokensWhen the PowerCenter session completes, the Integration Service writes the last restart token to a physical file called the RestartToken File. The token in the file indicates the end point where the read job ended. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it left off. The token file is overwritten each time the session has to write a token out. PowerCenter does not implicitly maintain an archive of these tokens.

If, for some reason, the changes from a particular point in time have to replayed, we need the PowerExchange token from that point in time.

To enable such a process, it is a good practice to periodically copy the token file to a backup folder. This procedure is necessary to maintain an archive of the PowerExchange tokens. A real-time PowerExchange session may be stopped periodically, using either the reader time limit or the idle time limit. A post-session command is used to copy the restart token file to an archive folder. The session will be part of a continuous running workflow, so when the session completes after the post session command, it automatically restarts again. From a data processing standpoint very little changes; the process pauses for a moment, archives the token, and starts again.

The following are examples of post-session commands that can be used to copy a restart token file (session.token) and append the current system date/time to the file name for archive purposes:

cp session.token session`date '+%m%d%H%M'`.token

Windows:

copy session.token session-%date:~4,2%-%date:~7,2%-%date:~10,4%-%time:~0,2%-%time:~3,2%.token

PowerExchange for WebSphere MQ

1. In the Workflow Manager, connect to a repository and choose Connection > Queue2. The Queue Connection Browser appears. Select New > Message Queue3. The Connection Object Definition dialog box appears

You need to specify three attributes in the Connection Object Definition dialog box:Name - the name for the connection. (Use _ to uniquely identify the connection.)Queue Manager - the Queue Manager name for the message queue. (in Windows, the default Queue Manager name is QM_)Queue Name - the Message Queue name

To obtain the Queue Manager and Message Queue names:

Open the MQ Series Administration Console. The Queue Manager should appear on the left panelExpand the Queue Manager icon. A list of the queues for the queue manager appears on the left panel

Note that the Queue Managers name and Queue Name are case-sensitive.

PowerExchange for JMS

PowerExchange for JMS can be used to read or write messages from various JMS providers, such as WebSphere MQ, JMS, BEA WebLogic Server.

There are two types of JMS application connections:

JNDI Application Connection, which is used to connect to a JNDI server during a session run.JMS Application Connection, which is used to connect to a JMS provider during a session run.

JNDI Application Connection Attributes are:

NameJNDI Context FactoryJNDI Provider URLJNDI UserNameJNDI PasswordJMS Application Connection

JMS Application Connection Attributes are:

NameJMS Destination TypeJMS Connection Factory NameJMS DestinationJMS UserNameJMS Password

Configuring the JNDI Connection for WebSphere MQ

The JNDI settings for WebSphere MQ JMS can be configured using a file system service or LDAP (Lightweight Directory Access Protocol).The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the WebSphere MQ

Java installation/bin directory.

If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactoryOr, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactoryFind the PROVIDER_URL settings.

If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following provider URL setting and provide a value for the JNDI directory.

PROVIDER_URL=file: /

is the directory where you want JNDI to store the .binding file.

Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the provider URL setting and specify a hostname.

#PROVIDER_URL=ldap:///context_name

For example, you can specify:

PROVIDER_URL=ldap:///o=infa,c=rc

If you want to provide a user DN and password for connecting to JNDI, you can remove the # from the following settings and enter a user DN and password:

PROVIDER_USERDN=cn=myname,o=infa,c=rcPROVIDER_PASSWORD=test

The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI application connection in the Workflow Manager:

JMSAdmin.config Settings: JNDI Application Connection Attribute

INITIAL_CONTEXT_FACTORY JNDI Context Factory

PROVIDER_URL JNDI Provider URL

PROVIDER_USERDN JNDI UserName

PROVIDER_PASSWORD JNDI Password

Configuring the JMS Connection for WebSphere MQ

The JMS connection is defined using a tool in JMS called jmsadmin, which is available in the WebSphere MQ Java installation/bin directory. Use this tool to configure the JMS Connection Factory.

The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.

When Queue Connection Factory is used, define a JMS queue as the destination.When Connection Factory is used, define a JMS topic as the destination.

The command to define a queue connection factory (qcf) is:def qcf() qmgr(queue_manager_name)hostname (QM_machine_hostname) port (QM_machine_port)The command to define JMS queue is:

def q() qmgr(queue_manager_name) qu(queue_manager_queue_name)The command to define JMS topic connection factory (tcf) is:def tcf() qmgr(queue_manager_name)hostname (QM_machine_hostname) port (QM_machine_port)The command to define the JMS topic is:

def t() topic(pub/sub_topic_name)The topic name must be unique. For example: topic (application/infa)The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

JMS Object Types JMS Application Connection Attribute

QueueConnectionFactory orTopicConnectionFactory

JMS Connection Name

JMS Queue Name orJMS Topic Name

JMS Destination

Configure the JNDI and JMS Connection for WebSphere

Configure the JNDI settings for WebSphere to use WebSphere as a provider for JMS sources or targets in a PowerCenterRT session.

JNDI Connection

Add the following option to the file JMSAdmin.bat to configure JMS properly:

-Djava.ext.dirs=binFor example: -Djava.ext.dirs=WebSphere\AppServer\binThe JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series Java/bin directory.

INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory

PROVIDER_URL=iiop:///

For example:

PROVIDER_URL=iiop://localhost/

PROVIDER_USERDN=cn=informatica,o=infa,c=rcPROVIDER_PASSWORD=test

JMS Connection

The JMS configuration is similar to the JMS Connection for WebSphere MQ.

Configure the JNDI and JMS Connection for BEA WebLogic

Configure the JNDI settings for BEA WebLogic to use BEA WebLogic as a provider for JMS sources or targets in a PowerCenterRT session.

PowerCenter Connect for JMS and the JMS hosting Weblogic server do not need to be on the same server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the right place.JNDI Connection

The WebLogic Server automatically provides a context factory and URL during the JNDI set-up configuration for WebLogic Server. Enter these values to configure the JNDI connection for JMS sources and targets in the Workflow Manager.

Enter the following value for JNDI Context Factory in the JNDI Application Connection in the Workflow Manager:

weblogic.jndi.WLInitialContextFactoryEnter the following value for JNDI Provider URL in the JNDI Application Connection in the Workflow Manager:

t3://:

where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and port is the port number for the WebLogic Server.

JMS Connection

The JMS connection is configured from the BEA WebLogic Server console. Select JMS -> Connection Factory.

The JMS Destination is also configured from the BEA WebLogic Server console.

From the Console pane, select Services > JMS > Servers > > Destinations under your domain.

Click Configure a New JMSQueue or Configure a New JMSTopic.

The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

WebLogic Server JMS Object JMS Application Connection Attribute

Connection Factory Settings: JNDIName JMS Application Connection Attribute

Connection Factory Settings: JNDIName JMS Connection Factory Name

Destination Settings: JNDIName JMS Destination

In addition to JNDI and JMS setting, BEA WebLogic also offers a function called JMS Store, which can be used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is available from the Console pane: select Services > JMS > Stores under your domain.

Configuring the JNDI and JMS Connection for TIBCO

TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter Connect for JMS cant connect directly with the Rendezvous Server. TIBCO Enterprise Server, which is JMS-compliant, acts as a bridge between the PowerCenter Connect for JMS and TIBCO Rendezvous Server. Configure a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server for PowerCenter Connect for JMS to be able to read messages from and write messages to TIBCO Rendezvous Server.

To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous Server, follow these steps:

1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise Server.2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server.

Configure the following information in your JNDI application connection:

JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactoryProvider URL.tibjmsnaming://: where host and port are the host name and port number of the Enterprise Server.

To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server:

1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below, so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems:

tibrv_transports = enabled

1. Enter the following transports in the transports.conf file:

[RV] type = tibrv // type of external messaging system topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer daemon = tcp:localhost:7500 // default daemon for the Rendezvous server

The transports in the transports.conf configuration file specify the communication protocol between TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties on a destination can list one or more transports to use to communicate with the TIBCO Rendezvous system.

1. Optionally, specify the name of one or more transports for reliable and certified message delivery in the export property in the file topics.conf. as in the following example:

The export property allows messages published to a topic by a JMS client to be exported to the external systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous reliable and certified messaging protocols.

PowerExchange for webMethodsWhen importing webMethods sources into the Designer, be sure the webMethods host file doesnt contain . character. You cant use fully-qualified names for the connection when importing webMethods sources. You can use fully-qualified names for the connection when importing webMethods targets because PowerCenter doesnt use the same grouping method for importing sources and targets. To get around this, modify the host file to resolve the name to the IP address.

For example:

Host File:

crpc23232.crp.informatica.com crpc23232

Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods source definition. This step is only required for importing PowerExchange for webMethods sources into the Designer.

If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate document back to the broker for every document it receives. PowerCenter populates some of the envelope fields of the webMethods target to enable webMethods broker to recognize that the published document is a reply from PowerCenter. The envelope fields destid and tag are populated for the request/reply model. Destid should be populated from the pubid of the source document and tag should be populated from tag of the source document. Use the option Create Default Envelope Fields when importing webMethods sources and targets into the Designer in order to make the envelope fields available in PowerCenter.

Configuring the PowerExchange for webMethods Connection

To create or edit the PowerExchange for webMethods connection select Connections > Application > webMethods Broker from the Workflow Manager.

PowerExchange for webMethods connection attributes are:

NameBroker HostBroker NameClient IDClient GroupApplication NameAutomatic ReconnectPreserve Client State

Enter the connection to the Broker Host in the following format .

If you are using the request/reply method in webMethods, you have to specify a client ID in the connection. Be sure that the client ID used in the request connection is the same as the client ID used in the reply connection. Note that

if you are using multiple request/reply document pairs, you need to setup different webMethods connections for each pair because they cannot share a client ID.

Master Data Management Architecture with Informatica

Master Data Management Architecture with Informatica

ChallengeData integration is critical to managing the modern business environment as companies find themselves with multiple redundant systems that contain master data built on differing data models and data definitions. This provides a challenge in data governance in terms of orchestrating people, policies, procedures and technology to manage enterprise data availability, usability, integrity and security for business process efficiency and compliance.

Master data management addresses three major challenges in the modern business environment:A need for cross-enterprise perspective for better business intelligence,A similar need for consistency across customer records for improved transaction management.An ability to provide data governance at the enterprise level.A requirement to coexist with existing information technology infrastructure.

DescriptionA logical view of the MDM Hub, the data flow through the Hub, and the physical architecture of the Hub are described in the following sections.

Logical ViewA logical view of the MDM Hub is shown below:

The Hub supports access of data in the form of batch, real-time and/or asynchronous messaging. Typically, this access is supported through a combination of data integration tools, such as Informatica Power Center and embedded Hub functionality. In order to master the data in the hub optimally, the source data needs to be analyzed. This analysis typically takes place using a data quality tool, such as Informatica Data Quality.

The goal of the Hub is to master data for one or more domains within a Customers environment. In the MDM Hub, there is a significant amount of metadata maintained in order to support data mastering functionality, such as lineage, history, survivorship and the like. The MDM Hub data model is completely flexible and can start from a Customers existing model, and industry standard model, or a model may be created from scratch.

Once the data model has been defined, data needs to be cleansed and standardized. The MDM Hub provides an open architecture allowing a Customer to leverage any Cleanse engine which they may already leverage, and it provides an optimized interface for Informatica Data Quality.

Data is then matched in the system using a combination of deterministic and fuzzy matching. Informatica Identity Recognition is the underlying match technology in the Hub, and the interfaces to it have been optimized for Hub use and the interfaces abstracted such that they are easily leveraged by business users.

After matching has been performed, the Hub can consolidate records by linking them together to produce a registry of related records or by merging them to produce a Golden Record or a Best Version of the Truth (BVT). When a BVT is produced, survivorship rules defined in the MDM trust framework are applied such that the appropriate attributes from the contributing source records are promoted into the BVT.

The BVT provides a basis for indentifying and managing relationships across entities and sources. By building on top of the BVT, the MDM Hub can expose relationships which are cross source or cross entity and are not visible within an individual source.

A data governance framework is exposed to data stewards through the Informatica Data Director (IDD). IDD provides data governance task management functionality, rudimentary data governance workflows, and data steward views of the data. If more complex workflows are required, external workflow engines can be easily integrated into the Hub. Individual views of data from within the IDD can also be exposed directly into applications through Informatica Data Controls.

There is an underlying security framework within the MDM Hub that provides fine grained controls of the access of

data within the Hub. The framework supports configuration of the security policies locally, or by consuming them from external sources, based on a customers desired infrastructure.

Data Flow

A typical data flow through the Hub is shown below:

Implementations of the MDM hub start by defining the data model into which all of the data will be consolidated. This target data model will contain the BVT and the associated metadata to support it. Source data is brought into the hub by putting it into a set of Landing Tables. A Landing Table is a representation of the data source in the general form of the source. There is an equivalent table known as a Staging Table, which represents the source data, but in the format of the Target Data model. Therefore, data needs to be transformed from the Landing Table to the Staging table, and this happens within the MDM Hub as follows:

1. The incoming data is run through a Delta Detection process to determine if it has changed since the last time it was processed. Only records that have changed are processed.

2. Records are run through a staging process which transforms the data to the form of the Target Model. The staging process is a mapping within the MDM Hub which may perform any number of standardization, cleansing or transformation processes. The mappings also allow for external cleanse engines to be invoked.

3. Records are then loaded into the landing table. The pre-cleansed version of the records are stored in a RAW table, and records which are inappropriate to stage (for example, they have structural deficiencies such as a duplicate PKEY) are written to a REJECT table to be manually corrected at a later time.

The data in the Staging Table is then loaded into the Base Objects. This process first applies a trust scores to attributes for which it has been defined. Trust scores represent the relative survivorship of an attribute and are calculated at the time the record is loaded, based on the currency of the data, the data source, and other characteristics of the attribute.

Records are then pushed through a matching process which generates a set of candidates for merging. Depending on which match rules caused a record to match, the record will be queued either for automatic merging or for manual merging. Records that do not match will be loaded into the Base Object as unique records. Records queued for automatic merge will be processed by the Hub without human intervention; those queued for manual merge will be displayed to a Data Steward for further processing.

All data in the hub is available for consumption as a batch, as a set of outbound asynchronous messages or through a real-time services interface.

Physical Architecture

The MDM Hub is designed as three-tier architecture. These tiers consist of the MDM Hub Store, the MDM Hub Server(s) (includes Cleanse-Match Servers) and the MDM User Interface.The Hub Store is where business data is stored and consolidated. The Hub Store contains common information about all of the databases that are part of an MDM Hub implementation. It resides in a supported database server environment. The Hub Server is the run-time component that manages core and common services for the MDM Hub. The Hub Server is a J2EE application, deployed on a supported application server that orchestrates the data processing within the Hub Store, as well as integration with external applications. Refer to the latest Product Availability Matrix for which versions of databases, application servers, and operating systems are currently supported for the MDM Hub.

The Hub may be implemented in either a standard architecture or in a high availability architecture. In order to achieve high availability, Informatica recommends the configuration shown below:

This configuration employs a properly sized DB server and application server(s). The DB server is configured as multiple DB cluster nodes. The database is distributed in SAN architecture. The application server requires

sufficient file space to support efficient match batch group sizes. Refer to the MDM Sizing Guidelines to properly size each of these tiers.

Data base redundancy is provided through the use of the database cluster, and application server redundancy is provided through application server clustering.

To support geographic distribution, the HA architecture described above is replicated in a second node, with failover provided using a log replication approach. This configuration is intended to support Hot/Warm or Hot/Cold environments, but does not support Hot/Hot operation.

Leveraging PowerCenter Concurrent Workflows

Leveraging PowerCenter Concurrent WorkflowsChallengeBefore the introduction of PowerCenters Concurrent Workflow feature, customers would make copies of workflows and run those using different names. This not only caused additional work, but also created maintenance issues during changes to the workflow logic. With PowerCenters Concurrent Workflow feature, it is now possible to run more than one instance of a workflow.

DescriptionUse Case Scenarios

Message Queue Processing

When data is read from a message queue, the data values in the queue can be used to determine which source data to process and which targets to load the processed data. In this scenario, different instances of the same workflow should run concurrently and pass different connection parameters to the instances of the workflow depending on the parameters read from the message queue. One example is a hosted data warehouse for 120 financial institutions where it is necessary to execute workflows for all the institutions in a small time frame.

Web Services

Different consumers of a web service need the capability to launch workflows to extract data from different external systems and integrate it with internal application data. Each instance of the workflow can accept different parameters to determine where to extract the data from and where to load the data to. For example, the Web Services Hub needs to execute multiple instances of the same web service workflow when web services requests increase.

Configuring Concurrent WorkflowsOne option is to run with the same instance name. When a workflow is configured to run with the same instance name, the Integration Service uses the same variables and parameters for each run. The Workflow Monitor displays the Run Id to distinguish between each workflow.

Informatica recommends using unique instance names instead of the same name with different Run Id values to implement Concurrent Workflows. With unique instance names, it is possible to allow concurrent runs only with unique instance names. This option enables execution of Concurrent Workflows with different instance names. For example, a different name can be used for the configuration of each workflow instance and a separate parameter file can be used for each instance. The Integration Service can persist variables for each workflow instance. When the workflow is executed, the Integration Service runs only the configured instances of the workflow.

Tips & Techniques

There are several tips and techniques that should be considered for the successful implementation of Concurrent

Workflows. If the target is a database system, database partitioning can be used to prevent contention issues when inserting or updating the same table. When database partitioning is used, concurrent writes to the same table will less likely encounter deadlock issues.

Competing resources such as lookups are another source of concern that should be addressed when running Concurrent Workflows. Lookup caches as well as log files should be exclusive for concurrent workflows to avoid contention.

Partitioning should also be considered. Mapping Partitioning or data partitioning is not impacted by the Concurrent Workflow feature and can be used with minimal impact.

On the other hand, parameter files should be created dynamically for the dynamic concurrent workflow option. This requires the development of a methodology to generate the parameter files at run time. A database driven option can be used for maintaining the parameters in database tables. During the execution of the Concurrent Workflows, the parameter files can be generated from the database.

Infa Parameters Variables&RealTime Session Config

Documents

Transcript of Infa Parameters Variables&RealTime Session Config