a Join vs Database Join
-
Upload
pradeep-kothakota -
Category
Documents
-
view
232 -
download
0
Transcript of a Join vs Database Join
-
8/3/2019 a Join vs Database Join
1/21
Informatica Join Vs Database Join
In this yet another "DWBI Concepts' Original article", we test the performance of
Informatica PowerCentre 8.5 Joiner transformation versus Oracle 10g database join.
This article gives a crucial insight to application developer in order to take informed
decision regarding performance tuning.
Which is the fastest? Informatica or Oracle?
In our previous article, we tested the performance of ORDER BY operation inInformatica and Oracle and found that, in our test condition, Oracle performssorting 14% speedier than Informatica. This time we will look into the JOINoperation, not only because JOIN is the single most important data setoperation but also because performance of JOIN can give crucial data to a
developer in order to develop proper push down optimization manually.
Informatica is one of the leading data integration tools in todays world. Morethan 4,000 enterprises worldwide rely on Informatica to access, integrateand trust their information assets with it. On the other hand, Oracle databaseis arguably the most successful and powerful RDBMS system that is trustedfrom 1980s in all sorts of business domain and across all major platforms.Both of these systems are bests in the technologies that they support. Butwhen it comes to the application development, developers often facechallenge to strike the right balance of operational load sharing betweenthese systems. This article will help them to take the informed decision.
Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joiningsyntaxes in database level to join your data or using JOINERTRANSFORMATION in Informatica to achieve the same outcome. Thequestion is which system performs this faster?
Test Preparation
We will perform the same test with 4 different data points (data volumes)and log the results. We will start with 1 million data in detail table and 0.1million in master table. Subsequently we will test with 2 million, 4 million and6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 millionmaster table data volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target
2. Informatica PowerCentre 8.5 as ETL tool
http://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.html -
8/3/2019 a Join vs Database Join
2/21
3. Database and Informatica setup on different physical servers using HP UNIX
4. Source database table has no constraint, no index, no database statistics and no
partition
5. Source database table is not available in Oracle shared pool before the same is
read
6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query
8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created inInformatica PowerCentre designer. The first mapping m_db_side_join will usean INNER JOIN clause in the source qualifier to sort data in database level.Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN datain informatica level. We have executed these mappings with different datapoints and logged the result.
Further to the above test we will execute m_db_side_join mapping onceagain, this time with proper database side indexes and statistics and log theresults.
Result
The following graph shows the performance of Informatica and Database interms of time taken by each system to sort data. The average time isplotted along vertical axis and data points are plotted alonghorizontal axis.
Data
Points
Master Table Record
Count
Detail Table Record
Count
1 0.1 M 1 M
2 0.2 M 2 M
3 0.4 M 4 M
4 0.6 M 6 M
-
8/3/2019 a Join vs Database Join
3/21
Verdict
In our test environment, Oracle 10g performs JOIN operation 24% faster
than Informatica Joiner Transformation while without Index and 42%
faster with Database Index
Assumption
1. Average server load remains same during all the experiments2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot beused for performance benchmarking.2. This data is only indicative and may vary in different testing conditions.
What is incremental aggregation?
When using incremental aggregation, you apply captured changes in the source toaggregate calculations in a session. If the source changes only incrementally and
you can capture changes, you can configure the session to process only those
changes. This allows the Informatica Server to update your target incrementally,
rather than forcing it to process the entire source and recalculate the same
calculations each time you run the session.
-
8/3/2019 a Join vs Database Join
4/21
Comparing Performance of SORT operation(Order By) in Informatica and Oracle
In this "DWBI Concepts' Original article", we put Oracle database and Informatica
PowerCentre to lock horns to prove which one of them handles data SORTing
operation faster. This article gives a crucial insight to application developer in order
to take informed decision regarding performance tuning.
Which is the fastest? Informatica or Oracle?
Informatica is one of the leading data integration tools in todays world. Morethan 4,000 enterprises worldwide rely on Informatica to access, integrateand trust their information assets with it. On the other hand, Oracle database
is arguably the most successful and powerful RDBMS system that is trustedfrom 1980s in all sorts of business domain and across all major platforms.Both of these systems are bests in the technologies that they support. Butwhen it comes to the application development, developers often facechallenge to strike the right balance of operational load sharing betweenthese systems.
Think about a typical ETL operation often used in enterprise level dataintegration. A lot of data processing can be either redirected to the databaseor to the ETL tool. In general, both the database and the ETL tool arereasonably capable of doing such operations with almost same efficiency and
capability. But in order to achieve the optimized performance, a developermust carefully consider and decide which system s/he should be trustingwith for each individual processing task.
In this article, we will take a basic database operation Sorting, and we willput these two systems to test in order to determine which does it faster thanthe other, if at all.
Which sorts data faster? Oracle or Informatica?
As an application developer, you have the choice of either using ORDER BY in
database level to sort your data or using SORTER TRANSFORMATION inInformatica to achieve the same outcome. The question is which systemperforms this faster?
Test Preparation
We will perform the same test with different data points (data volumes) andlog the results. We will start with 1 million records and we will be doubling
http://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.html -
8/3/2019 a Join vs Database Join
5/21
the volume for each next data points. Here are the details of the setup wewill use,
1. Oracle 10g database as relational source and target2. Informatica PowerCentre 8.5 as ETL tool
3. Database and Informatica setup on different physical servers using HPUNIX4. Source database table has no constraint, no index, no database statisticsand no partition5. Source database table is not available in Oracle shared pool before thesame is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. The source table has 10 columns and first 8 columns will be used forsorting9. Informatica sorter has enough cache size
We have used two sets of Informatica PowerCentre mappings created inInformatica PowerCentre designer. The first mapping m_db_side_sort will usean ORDER BY clause in the source qualifier to sort data in database level.Second mapping m_Infa_side_sort will use an Informatica sorter to sort datain informatica level. We have executed these mappings with different datapoints and logged the result.
Result
The following graph shows the performance of Informatica and Database in
terms of time taken by each system to sort data. The time is plotted alongvertical axis and data volume is plotted along horizontal axis.
-
8/3/2019 a Join vs Database Join
6/21
Verdict
The above experiment demonstrates that Oracle database
is faster in SORT operation than Informatica by an
average factor of 14%.
Assumption
1. Average server load remains same during all the experiments2. Average network speed remains same during all the experiments
Note
This data can only be used for performance comparison but cannot be used
for performance benchmarking.
To know the Informatica and Oracle performance comparison for JOINoperation
Implementing Informatica IncrementalAggregation
Using incremental aggregation, we apply captured changes in the source data (CDCpart) to aggregate calculations in a session. If the source changes incrementally andwe can capture the changes, then we can configure the session to process thosechanges. This allows the Integration Service to update the target incrementally,rather than forcing it to delete previous loads data, process the entire source dataand recalculate the same data each time you run the session.
Using Informatica Normalizer Transformation
Normalizer, a native transformation in Informatica, can ease many complexdata transformation requirement. Learn how to effectively use normalizerhere.
Using Noramalizer Transformation
A Normalizer is an Active transformation that returns multiple rows from asource row, it returns duplicate data for single-occurring source columns. The
http://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/23-using-informatica-normalizer-transformation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/23-using-informatica-normalizer-transformation.html -
8/3/2019 a Join vs Database Join
7/21
Normalizer transformation parses multiple-occurring columns from COBOLsources, relational tables, or other sources. Normalizer can be used totranspose the data in columns to rows.
Normalizer effectively does the opposite of Aggregator!
Example of Data Transpose using Normalizer
Think of a relational table that stores four quarters of sales by store and weneed to create a row for each sales occurrence. We can configure aNormalizer transformation to return a separate row for each quarter likebelow..
The following source rows contain four quarters of sales by store:
Source Table
Store Quarter1 Quarter2 Quarter3 Quarter4
Store1 100 300 500 700
Store2 250 450 650 850
The Normalizer returns a row for each store and sales combination. It also
returns an index(GCID) that identifies the quarter number:Target Table
Store Sales Quarter
Store 1 100 1
Store 1 300 2
Store 1 500 3
Store 1 700 4
-
8/3/2019 a Join vs Database Join
8/21
Store 2 250 1
Store 2 450 2
Store 2 650 3
Store 2 850 4
How Informatica Normalizer Works
Suppose we have the following data in source:
Name Month Transportation House Rent Food
Sam Jan 200 1500 500
John Jan 300 1200 300
Tom Jan 300 1350 350
Sam Feb 300 1550 450
John Feb 350 1200 290
Tom Feb 350 1400 350
and we need to transform the source data and populate this as below in thetarget table:
Name Month Expense Type Expense
Sam Jan Transport 200
-
8/3/2019 a Join vs Database Join
9/21
Sam Jan House rent 1500
Sam Jan Food 500
John Jan Transport 300
John Jan House rent 1200
John Jan Food 300
Tom Jan Transport 300
Tom Jan House rent 1350
Tom Jan Food 350
.. like this.
Now below is the screen-shot of a complete mapping which shows how toachieve this result using Informatica PowerCenter Designer. Image:Normalization Mapping Example 1
I will explain the mapping further below.
Setting Up Normalizer Transformation Property
First we need to set the number of occurences property of the Expense head as 3 in
the Normalizer tab of the Normalizer transformation, since we have Food,Houserent
and Transportation.
Which in turn will create the corresponding 3 input ports in the ports tab along with
the fields Individual and Month
http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200; -
8/3/2019 a Join vs Database Join
10/21
In the Ports tab of the Normalizer the ports will be created automatically as
configured in the Normalizer tab. Interestingly we will observe two newcolumns namely GK_EXPENSEHEAD and GCID_EXPENSEHEAD.GK field generates sequence number starting from the value as defined inSequence field while GCID holds the value of the occurence field i.e. thecolumn no of the input Expense head.Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.
-
8/3/2019 a Join vs Database Join
11/21
Now the GCID will give which expense corresponds to which field whileconverting columns to rows.
Below is the screen-shot of the expression to handle this GCID efficiently:
What is the difference between Normal load and Bulkload?Load types:- 1)Bulk Load 2)Normal Load
Normal load:-1)in case of less data.2)we can get its log details3)we can rollback and commit.4)Session recovery possible.5)performance may be low .
Bulk load :-1)In case of large data2)no log details are available.3)can't rollback and commit4)session recovery not possible.
-
8/3/2019 a Join vs Database Join
12/21
5)performance improves.
Implementing Informatica Partitions
Why use Informatica Pipeline Partition?
Identification and elimination of performance bottlenecks will obviouslyoptimize session performance. After tuning all the mapping bottlenecks, wecan further optimize session performance by increasing the number ofpipeline partitions in the session. Adding partitions can improveperformance by utilizing more of the system hardware while processing thesession.
PowerCenter Informatica Pipeline Partition
Different Types of Informatica Partitions
We can define the following partition types: Database partitioning, Hashauto-keys, Hash user keys, Key range, Pass-through, Round-robin.
Informatica Pipeline Partitioning Explained
Each mapping contains one or more pipelines. A pipeline consists of a sourcequalifier, all the transformations and the target. When the Integration Serviceruns the session, it can achieve higher performance by partitioning thepipeline and performing the extract, transformation, and load for each
partition in parallel.
A partition is a pipeline stage that executes in a single reader,transformation, or writer thread. The number of partitions in any pipelinestage equals the number of threads in the stage. By default, theIntegration Service creates one partition in every pipeline stage. Ifwe have the Informatica Partitioning option, we can configure multiplepartitions for a single pipeline stage.
Setting partition attributes includes partition points, the number of partitions,and the partition types. In the session properties we can add or edit partition
points. When we change partition points we can define the partition type andadd or delete partitions(number of partitions).
We can set the following attributes to partition a pipeline:
Partition point: Partition points mark thread boundaries and divide thepipeline into stages. A stage is a section of a pipeline between any twopartition points. The Integration Service redistributes rows of data at
http://www.dwbiconcepts.com/advance/4-etl/12-implementing-informatica-partitions-.htmlhttp://www.dwbiconcepts.com/advance/4-etl/12-implementing-informatica-partitions-.html -
8/3/2019 a Join vs Database Join
13/21
partition points. When we add a partition point, we increase the number ofpipeline stages by one. Increasing the number of partitions or partition pointsincreases the number of threads. We cannot create partition points at Sourceinstances or at Sequence Generator transformations.
Number of partitions: A partition is a pipeline stage that executes in asingle thread. If we purchase the Partitioning option, we can set the numberof partitions at any partition point. When we add partitions, we increase thenumber of processing threads, which can improve session performance. Wecan define up to 64 partitions at any partition point in a pipeline. When weincrease or decrease the number of partitions at any partition point, theWorkflow Manager increases or decreases the number of partitions at allpartition points in the pipeline. The number of partitions remains consistentthroughout the pipeline. The Integration Service runs the partition threadsconcurrently.
Partition types: The Integration Service creates a default partition type ateach partition point. If we have the Partitioning option, we can change thepartition type. The partition type controls how the Integration Servicedistributes data among partitions at partition points. We can define thefollowing partition types: Database partitioning, Hash auto-keys, Hash userkeys, Key range, Pass-through, Round-robin. Database partitioning: TheIntegration Service queries the database system for table partitioninformation. It reads partitioned data from the corresponding nodes in thedatabase.
Pass-through: The Integration Service processes data without redistributing
rows among partitions. All rows in a single partition stay in the partition aftercrossing a pass-through partition point. Choose pass-through partitioningwhen we want to create an additional pipeline stage to improveperformance, but do not want to change the distribution of data acrosspartitions.
Round-robin: The Integration Service distributes data evenly among allpartitions. Use round-robin partitioning where we want each partition toprocess approximately the same numbers of rows i.e. load balancing.
Hash auto-keys: The Integration Service uses a hash function to group
rows of data among partitions. The Integration Service groups the databased on a partition key. The Integration Service uses all grouped or sortedports as a compound partition key. We may need to use hash auto-keyspartitioning at Rank, Sorter, and unsorted Aggregator transformations.
Hash user keys: The Integration Service uses a hash function to group rowsof data among partitions. We define the number of ports to generate thepartition key.
-
8/3/2019 a Join vs Database Join
14/21
Key range: The Integration Service distributes rows of data based on a portor set of ports that we define as the partition key. For each port, we define arange of values. The Integration Service uses the key and ranges to sendrows to the appropriate partition. Use key range partitioning when thesources or targets in the pipeline are partitioned by key range.
We cannot create a partition key for hash auto-keys, round-robin, orpass-through partitioning.
Add, delete, or edit partition points on the Partitions view on the Mapping tabof session properties of a session in Workflow Manager.
The PowerCenter Partitioning Option increases the performance ofPowerCenter through parallel data processing.
This option provides a thread-based architecture and automatic data
partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.
Implementing Informatica Persistent Cache
You must have noticed that the time Informatica takes to build the lookupcache can be too much sometimes depending on the lookup tablesize/volume. Using Persistent Cache, you may save lot of your time.
What is Persistent Cache?
Lookups are cached by default in Informatica. This means thatInformatica by default brings in the entire data of the lookup table fromdatabase server to Informatica Server as a part of lookup cache
building activity during session run. If the lookup table is too huge, this oughtto take quite some time. Now consider this scenario - what if you are lookingup to the same table different times using different lookups in differentmappings? Do you want to spend the time of building the lookup cache againand again for each lookup? Off course not! Just use persistent cache option!Yes, Lookup cache can be either non-persistent or persistent. The IntegrationService saves or deletes lookup cache files after a successful session run
based on whether the Lookup cache is checked as persistent or not.
Where and when we shall use persistent cache:
Suppose we have a lookup table with same lookup condition andreturn/output ports and the lookup table is used many times in multiplemappings. Let us say a Customer Dimension table is used in many mappingsto populate the surrogate key in the fact tables based on their source system
http://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.htmlhttp://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.html -
8/3/2019 a Join vs Database Join
15/21
-
8/3/2019 a Join vs Database Join
16/21
Re-cache from lookup source:To be checked i.e. the Named PersistentCache file will be rebuilt or refreshed with the current data of the lookuptable.
Next in all the mappings where we want to use the same already built
Named Persistent Cache we need to set two properties in the Properties tabof Lookup transformation.
Lookup cache persistent:To be checked i.e. the lookup will be using aNamed Persistent Cache that is already saved in Cache Directory and if thecache file is not there the session will not fail it will just create the cache fileinstead.Cache File Name Prefix: user_defined_cache_file_name i.e. the Named
-
8/3/2019 a Join vs Database Join
17/21
Persistent cache file name that was defined in the mapping where thepersistent cache file was created.
Note:
If there is any Lookup SQL Override then the SQL statement in all the lookupsshould match exactly even also an extra blank space will fail the session thatis using the already built persistent cache file.
So if the incoming source data volume is high, the lookup tables datavolume that need to be cached is also high, and the same lookup table isused in many mappings then the best way to handle the situation is to useone-time build, already created persistent named cache.
Aggregation with out Informatica Aggregator
Since Informatica process data row by row, it is generally possible to handledata aggregation operation even without an Aggregator Transformation. Oncertain cases, you may get huge performance gain using this technique!
General Idea of Aggregation without Aggregator
Transformation
Let us take an example: Suppose we want to find the SUM of SALARY forEach Department of the Employee Table. The SQL query for this would be:
SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;
If we need to implement this in Informatica, it would be very easy as wewould obviously go for an Aggregator Transformation. By taking the DEPTNOport as GROUP BY and one output port as SUM(SALARY the problem can besolved easily.
Now the trick is to use only Expression to achieve the functionality ofAggregator expression. We would use the very funda of the expressiontransformation of holding the value of an attribute of the previous tuple over
here.
But wait... why would we do this? Aren't we complicating the thing here?
Yes, we are. But as it appears, in many cases, it might have an performancebenefit (especially if the input is already sorted or when you know input datawill not violate the order, like you are loading daily data and want to sort itby day). Remember Informatica holds all the rows in Aggregator cache for
http://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.html -
8/3/2019 a Join vs Database Join
18/21
aggregation operation. This needs time and cache space and this also voidsthe normal row by row processing in Informatica. By removing theAggregator with an Expression, we reduce cache space requirement andease out row by row processing. The mapping below will show how to do this
Image: Aggregation with Expression and Sorter 1
Sorter (SRT_SAL) Ports Tab
Now I am showing a sorter here just illustrate the concept. If you already have
sorted data from the source, you need not use this thereby increasing the
performance benefit.
Expression (EXP_SAL) Ports Tab
Image: Expression Ports Tab Properties
Sorter (SRT_SAL1) Ports Tab
http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr3.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr3.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200; -
8/3/2019 a Join vs Database Join
19/21
Expression (EXP_SAL2) Ports Tab
-
8/3/2019 a Join vs Database Join
20/21
-
8/3/2019 a Join vs Database Join
21/21
Filter (FIL_SAL) Properties Tab
This is how we can implement aggregation without using Informatica
aggregator transformation. Hope you liked it!
Informatica Dynamic Lookup Cache
A LookUp cache does not change once built. But what if the underlyinglookup table changes the data after the lookup cache is created? Is there away so that the cache always remain up-to-date even if the underlying tablechanges?
http://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.html