a Join vs Database Join

8/3/2019 a Join vs Database Join

1/21

Informatica Join Vs Database Join

In this yet another "DWBI Concepts' Original article", we test the performance of

Informatica PowerCentre 8.5 Joiner transformation versus Oracle 10g database join.

This article gives a crucial insight to application developer in order to take informed

decision regarding performance tuning.

Which is the fastest? Informatica or Oracle?

In our previous article, we tested the performance of ORDER BY operation inInformatica and Oracle and found that, in our test condition, Oracle performssorting 14% speedier than Informatica. This time we will look into the JOINoperation, not only because JOIN is the single most important data setoperation but also because performance of JOIN can give crucial data to a

developer in order to develop proper push down optimization manually.

Informatica is one of the leading data integration tools in todays world. Morethan 4,000 enterprises worldwide rely on Informatica to access, integrateand trust their information assets with it. On the other hand, Oracle databaseis arguably the most successful and powerful RDBMS system that is trustedfrom 1980s in all sorts of business domain and across all major platforms.Both of these systems are bests in the technologies that they support. Butwhen it comes to the application development, developers often facechallenge to strike the right balance of operational load sharing betweenthese systems. This article will help them to take the informed decision.

Which JOINs data faster? Oracle or Informatica?

As an application developer, you have the choice of either using joiningsyntaxes in database level to join your data or using JOINERTRANSFORMATION in Informatica to achieve the same outcome. Thequestion is which system performs this faster?

Test Preparation

We will perform the same test with 4 different data points (data volumes)and log the results. We will start with 1 million data in detail table and 0.1million in master table. Subsequently we will test with 2 million, 4 million and6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 millionmaster table data volumes. Here are the details of the setup we will use,

1. Oracle 10g database as relational source and target

2. Informatica PowerCentre 8.5 as ETL tool
http://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.html


2/21

3. Database and Informatica setup on different physical servers using HP UNIX

4. Source database table has no constraint, no index, no database statistics and no

partition

5. Source database table is not available in Oracle shared pool before the same is

read

6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query

8. Informatica JOINER has enough cache size

We have used two sets of Informatica PowerCentre mappings created inInformatica PowerCentre designer. The first mapping m_db_side_join will usean INNER JOIN clause in the source qualifier to sort data in database level.Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN datain informatica level. We have executed these mappings with different datapoints and logged the result.

Further to the above test we will execute m_db_side_join mapping onceagain, this time with proper database side indexes and statistics and log theresults.

Result

The following graph shows the performance of Informatica and Database interms of time taken by each system to sort data. The average time isplotted along vertical axis and data points are plotted alonghorizontal axis.

Data

Points

Master Table Record

Count

Detail Table Record

Count

1 0.1 M 1 M

2 0.2 M 2 M

3 0.4 M 4 M

4 0.6 M 6 M


3/21

Verdict

In our test environment, Oracle 10g performs JOIN operation 24% faster

than Informatica Joiner Transformation while without Index and 42%

faster with Database Index

Assumption

1. Average server load remains same during all the experiments2. Average network speed remains same during all the experiments

Note

1. This data can only be used for performance comparison but cannot beused for performance benchmarking.2. This data is only indicative and may vary in different testing conditions.

What is incremental aggregation?

When using incremental aggregation, you apply captured changes in the source toaggregate calculations in a session. If the source changes only incrementally and

you can capture changes, you can configure the session to process only those

changes. This allows the Informatica Server to update your target incrementally,

rather than forcing it to process the entire source and recalculate the same

calculations each time you run the session.


4/21

Comparing Performance of SORT operation(Order By) in Informatica and Oracle

In this "DWBI Concepts' Original article", we put Oracle database and Informatica

PowerCentre to lock horns to prove which one of them handles data SORTing

operation faster. This article gives a crucial insight to application developer in order

to take informed decision regarding performance tuning.

Which is the fastest? Informatica or Oracle?

Informatica is one of the leading data integration tools in todays world. Morethan 4,000 enterprises worldwide rely on Informatica to access, integrateand trust their information assets with it. On the other hand, Oracle database

is arguably the most successful and powerful RDBMS system that is trustedfrom 1980s in all sorts of business domain and across all major platforms.Both of these systems are bests in the technologies that they support. Butwhen it comes to the application development, developers often facechallenge to strike the right balance of operational load sharing betweenthese systems.

Think about a typical ETL operation often used in enterprise level dataintegration. A lot of data processing can be either redirected to the databaseor to the ETL tool. In general, both the database and the ETL tool arereasonably capable of doing such operations with almost same efficiency and

capability. But in order to achieve the optimized performance, a developermust carefully consider and decide which system s/he should be trustingwith for each individual processing task.

In this article, we will take a basic database operation Sorting, and we willput these two systems to test in order to determine which does it faster thanthe other, if at all.

Which sorts data faster? Oracle or Informatica?

As an application developer, you have the choice of either using ORDER BY in

database level to sort your data or using SORTER TRANSFORMATION inInformatica to achieve the same outcome. The question is which systemperforms this faster?

Test Preparation

We will perform the same test with different data points (data volumes) andlog the results. We will start with 1 million records and we will be doubling
http://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.html


5/21

the volume for each next data points. Here are the details of the setup wewill use,

1. Oracle 10g database as relational source and target2. Informatica PowerCentre 8.5 as ETL tool

3. Database and Informatica setup on different physical servers using HPUNIX4. Source database table has no constraint, no index, no database statisticsand no partition5. Source database table is not available in Oracle shared pool before thesame is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. The source table has 10 columns and first 8 columns will be used forsorting9. Informatica sorter has enough cache size

We have used two sets of Informatica PowerCentre mappings created inInformatica PowerCentre designer. The first mapping m_db_side_sort will usean ORDER BY clause in the source qualifier to sort data in database level.Second mapping m_Infa_side_sort will use an Informatica sorter to sort datain informatica level. We have executed these mappings with different datapoints and logged the result.

Result

The following graph shows the performance of Informatica and Database in

terms of time taken by each system to sort data. The time is plotted alongvertical axis and data volume is plotted along horizontal axis.


6/21

Verdict

The above experiment demonstrates that Oracle database

is faster in SORT operation than Informatica by an

average factor of 14%.

Assumption

1. Average server load remains same during all the experiments2. Average network speed remains same during all the experiments

Note

This data can only be used for performance comparison but cannot be used

for performance benchmarking.

To know the Informatica and Oracle performance comparison for JOINoperation

Implementing Informatica IncrementalAggregation

Using incremental aggregation, we apply captured changes in the source data (CDCpart) to aggregate calculations in a session. If the source changes incrementally andwe can capture the changes, then we can configure the session to process thosechanges. This allows the Integration Service to update the target incrementally,rather than forcing it to delete previous loads data, process the entire source dataand recalculate the same data each time you run the session.

Using Informatica Normalizer Transformation

Normalizer, a native transformation in Informatica, can ease many complexdata transformation requirement. Learn how to effectively use normalizerhere.

Using Noramalizer Transformation

A Normalizer is an Active transformation that returns multiple rows from asource row, it returns duplicate data for single-occurring source columns. The
http://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/23-using-informatica-normalizer-transformation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/23-using-informatica-normalizer-transformation.html


7/21

Normalizer transformation parses multiple-occurring columns from COBOLsources, relational tables, or other sources. Normalizer can be used totranspose the data in columns to rows.

Normalizer effectively does the opposite of Aggregator!

Example of Data Transpose using Normalizer

Think of a relational table that stores four quarters of sales by store and weneed to create a row for each sales occurrence. We can configure aNormalizer transformation to return a separate row for each quarter likebelow..

The following source rows contain four quarters of sales by store:

Source Table

Store Quarter1 Quarter2 Quarter3 Quarter4

Store1 100 300 500 700

Store2 250 450 650 850

The Normalizer returns a row for each store and sales combination. It also

returns an index(GCID) that identifies the quarter number:Target Table

Store Sales Quarter

Store 1 100 1

Store 1 300 2

Store 1 500 3

Store 1 700 4


8/21

Store 2 250 1

Store 2 450 2

Store 2 650 3

Store 2 850 4

How Informatica Normalizer Works

Suppose we have the following data in source:

Name Month Transportation House Rent Food

Sam Jan 200 1500 500

John Jan 300 1200 300

Tom Jan 300 1350 350

Sam Feb 300 1550 450

John Feb 350 1200 290

Tom Feb 350 1400 350

and we need to transform the source data and populate this as below in thetarget table:

Name Month Expense Type Expense

Sam Jan Transport 200


9/21

Sam Jan House rent 1500

Sam Jan Food 500

John Jan Transport 300

John Jan House rent 1200

John Jan Food 300

Tom Jan Transport 300

Tom Jan House rent 1350

Tom Jan Food 350

.. like this.

Now below is the screen-shot of a complete mapping which shows how toachieve this result using Informatica PowerCenter Designer. Image:Normalization Mapping Example 1

I will explain the mapping further below.

Setting Up Normalizer Transformation Property

First we need to set the number of occurences property of the Expense head as 3 in

the Normalizer tab of the Normalizer transformation, since we have Food,Houserent

and Transportation.

Which in turn will create the corresponding 3 input ports in the ports tab along with

the fields Individual and Month
http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;


10/21

In the Ports tab of the Normalizer the ports will be created automatically as

configured in the Normalizer tab. Interestingly we will observe two newcolumns namely GK_EXPENSEHEAD and GCID_EXPENSEHEAD.GK field generates sequence number starting from the value as defined inSequence field while GCID holds the value of the occurence field i.e. thecolumn no of the input Expense head.Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.


11/21

Now the GCID will give which expense corresponds to which field whileconverting columns to rows.

Below is the screen-shot of the expression to handle this GCID efficiently:

What is the difference between Normal load and Bulkload?Load types:- 1)Bulk Load 2)Normal Load

Normal load:-1)in case of less data.2)we can get its log details3)we can rollback and commit.4)Session recovery possible.5)performance may be low .

Bulk load :-1)In case of large data2)no log details are available.3)can't rollback and commit4)session recovery not possible.


12/21

5)performance improves.

Implementing Informatica Partitions

Why use Informatica Pipeline Partition?

Identification and elimination of performance bottlenecks will obviouslyoptimize session performance. After tuning all the mapping bottlenecks, wecan further optimize session performance by increasing the number ofpipeline partitions in the session. Adding partitions can improveperformance by utilizing more of the system hardware while processing thesession.

PowerCenter Informatica Pipeline Partition

Different Types of Informatica Partitions

We can define the following partition types: Database partitioning, Hashauto-keys, Hash user keys, Key range, Pass-through, Round-robin.

Informatica Pipeline Partitioning Explained

Each mapping contains one or more pipelines. A pipeline consists of a sourcequalifier, all the transformations and the target. When the Integration Serviceruns the session, it can achieve higher performance by partitioning thepipeline and performing the extract, transformation, and load for each

partition in parallel.

A partition is a pipeline stage that executes in a single reader,transformation, or writer thread. The number of partitions in any pipelinestage equals the number of threads in the stage. By default, theIntegration Service creates one partition in every pipeline stage. Ifwe have the Informatica Partitioning option, we can configure multiplepartitions for a single pipeline stage.

Setting partition attributes includes partition points, the number of partitions,and the partition types. In the session properties we can add or edit partition

points. When we change partition points we can define the partition type andadd or delete partitions(number of partitions).

We can set the following attributes to partition a pipeline:

Partition point: Partition points mark thread boundaries and divide thepipeline into stages. A stage is a section of a pipeline between any twopartition points. The Integration Service redistributes rows of data at
http://www.dwbiconcepts.com/advance/4-etl/12-implementing-informatica-partitions-.htmlhttp://www.dwbiconcepts.com/advance/4-etl/12-implementing-informatica-partitions-.html


13/21

partition points. When we add a partition point, we increase the number ofpipeline stages by one. Increasing the number of partitions or partition pointsincreases the number of threads. We cannot create partition points at Sourceinstances or at Sequence Generator transformations.

Number of partitions: A partition is a pipeline stage that executes in asingle thread. If we purchase the Partitioning option, we can set the numberof partitions at any partition point. When we add partitions, we increase thenumber of processing threads, which can improve session performance. Wecan define up to 64 partitions at any partition point in a pipeline. When weincrease or decrease the number of partitions at any partition point, theWorkflow Manager increases or decreases the number of partitions at allpartition points in the pipeline. The number of partitions remains consistentthroughout the pipeline. The Integration Service runs the partition threadsconcurrently.

Partition types: The Integration Service creates a default partition type ateach partition point. If we have the Partitioning option, we can change thepartition type. The partition type controls how the Integration Servicedistributes data among partitions at partition points. We can define thefollowing partition types: Database partitioning, Hash auto-keys, Hash userkeys, Key range, Pass-through, Round-robin. Database partitioning: TheIntegration Service queries the database system for table partitioninformation. It reads partitioned data from the corresponding nodes in thedatabase.

Pass-through: The Integration Service processes data without redistributing

rows among partitions. All rows in a single partition stay in the partition aftercrossing a pass-through partition point. Choose pass-through partitioningwhen we want to create an additional pipeline stage to improveperformance, but do not want to change the distribution of data acrosspartitions.

Round-robin: The Integration Service distributes data evenly among allpartitions. Use round-robin partitioning where we want each partition toprocess approximately the same numbers of rows i.e. load balancing.

Hash auto-keys: The Integration Service uses a hash function to group

rows of data among partitions. The Integration Service groups the databased on a partition key. The Integration Service uses all grouped or sortedports as a compound partition key. We may need to use hash auto-keyspartitioning at Rank, Sorter, and unsorted Aggregator transformations.

Hash user keys: The Integration Service uses a hash function to group rowsof data among partitions. We define the number of ports to generate thepartition key.


14/21

Key range: The Integration Service distributes rows of data based on a portor set of ports that we define as the partition key. For each port, we define arange of values. The Integration Service uses the key and ranges to sendrows to the appropriate partition. Use key range partitioning when thesources or targets in the pipeline are partitioned by key range.

We cannot create a partition key for hash auto-keys, round-robin, orpass-through partitioning.

Add, delete, or edit partition points on the Partitions view on the Mapping tabof session properties of a session in Workflow Manager.

The PowerCenter Partitioning Option increases the performance ofPowerCenter through parallel data processing.

This option provides a thread-based architecture and automatic data

partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.

Implementing Informatica Persistent Cache

You must have noticed that the time Informatica takes to build the lookupcache can be too much sometimes depending on the lookup tablesize/volume. Using Persistent Cache, you may save lot of your time.

What is Persistent Cache?

Lookups are cached by default in Informatica. This means thatInformatica by default brings in the entire data of the lookup table fromdatabase server to Informatica Server as a part of lookup cache

building activity during session run. If the lookup table is too huge, this oughtto take quite some time. Now consider this scenario - what if you are lookingup to the same table different times using different lookups in differentmappings? Do you want to spend the time of building the lookup cache againand again for each lookup? Off course not! Just use persistent cache option!Yes, Lookup cache can be either non-persistent or persistent. The IntegrationService saves or deletes lookup cache files after a successful session run

based on whether the Lookup cache is checked as persistent or not.

Where and when we shall use persistent cache:

Suppose we have a lookup table with same lookup condition andreturn/output ports and the lookup table is used many times in multiplemappings. Let us say a Customer Dimension table is used in many mappingsto populate the surrogate key in the fact tables based on their source system
http://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.htmlhttp://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.html


15/21


16/21

Re-cache from lookup source:To be checked i.e. the Named PersistentCache file will be rebuilt or refreshed with the current data of the lookuptable.

Next in all the mappings where we want to use the same already built

Named Persistent Cache we need to set two properties in the Properties tabof Lookup transformation.

Lookup cache persistent:To be checked i.e. the lookup will be using aNamed Persistent Cache that is already saved in Cache Directory and if thecache file is not there the session will not fail it will just create the cache fileinstead.Cache File Name Prefix: user_defined_cache_file_name i.e. the Named


17/21

Persistent cache file name that was defined in the mapping where thepersistent cache file was created.

Note:

If there is any Lookup SQL Override then the SQL statement in all the lookupsshould match exactly even also an extra blank space will fail the session thatis using the already built persistent cache file.

So if the incoming source data volume is high, the lookup tables datavolume that need to be cached is also high, and the same lookup table isused in many mappings then the best way to handle the situation is to useone-time build, already created persistent named cache.

Aggregation with out Informatica Aggregator

Since Informatica process data row by row, it is generally possible to handledata aggregation operation even without an Aggregator Transformation. Oncertain cases, you may get huge performance gain using this technique!

General Idea of Aggregation without Aggregator

Transformation

Let us take an example: Suppose we want to find the SUM of SALARY forEach Department of the Employee Table. The SQL query for this would be:

SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;

If we need to implement this in Informatica, it would be very easy as wewould obviously go for an Aggregator Transformation. By taking the DEPTNOport as GROUP BY and one output port as SUM(SALARY the problem can besolved easily.

Now the trick is to use only Expression to achieve the functionality ofAggregator expression. We would use the very funda of the expressiontransformation of holding the value of an attribute of the previous tuple over

here.

But wait... why would we do this? Aren't we complicating the thing here?

Yes, we are. But as it appears, in many cases, it might have an performancebenefit (especially if the input is already sorted or when you know input datawill not violate the order, like you are loading daily data and want to sort itby day). Remember Informatica holds all the rows in Aggregator cache for
http://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.html


18/21

aggregation operation. This needs time and cache space and this also voidsthe normal row by row processing in Informatica. By removing theAggregator with an Expression, we reduce cache space requirement andease out row by row processing. The mapping below will show how to do this

Image: Aggregation with Expression and Sorter 1

Sorter (SRT_SAL) Ports Tab

Now I am showing a sorter here just illustrate the concept. If you already have

sorted data from the source, you need not use this thereby increasing the

performance benefit.

Expression (EXP_SAL) Ports Tab

Image: Expression Ports Tab Properties

Sorter (SRT_SAL1) Ports Tab
http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr3.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr3.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;


19/21

Expression (EXP_SAL2) Ports Tab


20/21


21/21

Filter (FIL_SAL) Properties Tab

This is how we can implement aggregation without using Informatica

aggregator transformation. Hope you liked it!

Informatica Dynamic Lookup Cache

A LookUp cache does not change once built. But what if the underlyinglookup table changes the data after the lookup cache is created? Is there away so that the cache always remain up-to-date even if the underlying tablechanges?
http://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.html

a Join vs Database Join

Documents

Transcript of a Join vs Database Join