December 01-03, 2009 •Minneapolis, Chicago, Milwaukee

December 01-03, 2009 •Minneapolis, Chicago, Milwaukee1

Best Practices to Improve Query Performance in a Data Warehouse - 1

Calisto Zuzarte, STSM, IBM, [email protected]


Data Warehouse Life Cycle

Database design / Application design– The Warehouse Application architects and Database

Administrators work together to design the queries and schema before they put the application in production

Database performance layer implementation– In order to meet SLAs, DBAs usual go through some iterations

augmenting the database with performance layer objects and set up the initial configuration to get good performance

Database tuning operations– During production, with changing requirements and change in

data, there is on-going tuning required to keep operations smooth.


Motivation

Data warehouse environments characteristics:– Large volumes of data

• Millions/Billions of rows involved in some tables

• Large Joins

• Large Sorts,

• Large Aggregations

• Many tables involved

• Large amount of data rolled-in and rolled-out

– Complex queries • Report Queries

• Ad Hoc Queries

It is important to pay attention to query performance


Objective

Provide recommendations from a DB2 optimizer perspective to improve query performance through the Data Warehouse life cycle


Agenda

SESSION 1

Best Practices – Database Design

Best Practices – Application Design

Best Practices – Configuration and Operations

SESSION 2

Best Practices – Performance Layer



Parallelism– Inter-partition Shared nothing parallelism (DPF)

– Intra-Query Parallelism (SMP)

Partitioning– Database Partitioning

– Table Partitioning • Table (Range) Partitioning

• UNION ALL Views

– Multi-Dimension Clustering

Schema


Best Practices - Parallelism

DPF or SMP or both ?

Database partition feature (DPF) is generally recommended to achieve parallelism in a data warehouse

– Achieves scalability and query performance

SMP (Intra-Query Parallelism) is NOT recommended in concurrent multi-user environments with heavy CPU usage

SMP is only recommended – When CPUs are highly under utilized and when DPF is not an

option


Partitioning (Complimentary Strategies in DB2)

Database Partitioning (DPF)– CREATE TABLE … DISTRIBUTE BY HASH– Key Benefit : Better scalability and performance through

parallelism

Table Partitioning – Table (Range) Partitioning– CREATE TABLE …PARTITION BY RANGE– Key Benefit : Better data management (roll-in and roll-out of data)

– UNION ALL Views – CREATE VIEW V AS (SELECT … FROM F1 UNION ALL … )– Key Benefit : Independent branch optimization

Multidimensional Clustering (MDC)– CREATE TABLE … ORGANIZE BY DIMENSION– Key Benefit : Better query performance through data clustering

“Database Partitioning”“Distribution Key”

“Table Partitioning”“Table Partitioning Key”

“UNION ALL branchPartitioning”

“Cells”, “Blocks”, “Dimensions”


Distribute By … Partition By … Organize By ..

CREATE TABLE …

DISTRIBUTE BY HASH

PARTITION BY RANGE

ORGANIZE BY DIMENSION

East West East West East West East West East West East West

North South North South North South North South North South North South

TS1 TS2 TS1 TS2 TS1 TS2

Jan Feb Jan Feb Jan Feb

DatabasePartition 1

DatabasePartition 2

DatabasePartition 3


Best Practices – DPF Partitioning

Collocate the fact and largest frequently joined dimension

Choose to avoid significant skew on some partitions

Avoid DATE dimension where active transactions for current date all fall on one database partition (TIMESTAMP is good)

Possibilities for workload isolation for data marts– Different partition groups but common dimension tables

– Recommend that dimension tables be replicated (discussed later)


Best Practices – Table Partitioning

Recommend partitioning the fact tables

Recommend using the DATE dimension

Works better with application key predicates applied directly

Table (Range) Partitioning– Consider partitioned indexes with V9.7

– Choose partitioning based on roll-in / roll-out granularity

UNION ALL Views– Define view predicates or CHECK Constraints to get branch

elimination with query predicates (with constants only)

– Use UNION ALL views only with well designed applications• Dangers of materialization with ad hoc queries

• Large number of branches needs time and memory to optimize


Best Practices – Multidimensional Clustering (MDC)

Recommend defining MDC on the fact table– Guaranteed clustering (Avoids the need to REORG for

clustering)

– I/O optimization

– Compact indexes (compact, coexists with regular indexes)

Choose dimensions based on query predicates – Recommend the use of 1 to 4 dimensions

– Need to ensure dimensions are chosen such that they do not waste storage

Could choose a finer granularity of Table partitioning range– For example: Table partition range by month, MDC by date


Star Schema

Product_idStore_idChannel_idDate_id

AmountQuantity…

SALESStore_id

Region_id…

STORE

Date_id

Month_idQuarter_idYear_id

TIME

Channel_id

…

CHANNEL

Product_id

Class_idGroup_idFamily_idLine_idDivision_id…

PRODUCT


Dimension Hierarchy

Channel Store

Month

Product

Class

Group

Family

Line

Division

Quarter

Year

Retailer

Sales Fact

Product Dimension

Time Dimension

Store Dimension

Channel Dimension

Level 5

Level 1

Level 2

Level 3

Level 4

Level 0Date


Best Practices - Schema

Surrogate Keys– As far as possible use application keys themselves

• allows predicates to be applied/transferred directly on the fact table

• DATE is a good candidate (easier to roll-in/roll-out and for MDC )

Star Schema / Snowflakes– Separate tables for each dimension hierarchy (snowflake) may

result in a large number of joins

– Flattened dimensions may contain a lot of redundancy (space)

Define Columns NOT NULL when appropriate– Many optimizations that are done based on NOT NULL

Define Uniqueness when appropriate– Primary Keys / Unique Constraints / Unique Indexes


Agenda

SESSION 1




SESSION 2



Application Considerations - Expressions

Use constants instead of expressions in the query– Example

• SELECT … WHERE DateCol <= CURRENT DATE – 5

• Use VALUES(CURRENT DATE – 5) to get the constant first and use it in the query

Avoid expressions on indexed columns– Example

• SELECT … WHERE DATECOL – 2 DAYS > ‘2009-10-22’

• SELECT … WHERE DATECOL > ‘2009-10-22’ + 2 DAYS

Similar recommendation with cast functions– Example

• SELECT … WHERE INT(CHARCOL) = 2009

• SELECT … WHERE CHARCOL = ‘2009’

• Note you may lose Errors/Warnings


Application Considerations – Table Partitioning / MDC

As far as possible put local predicates directly on Table Partition or MDC dimension columns of the fact table

SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F

WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and

T.Date = ‘2009-01-15’ and T.KEYCOL= F.TIMEKEYCOL

Simplify if the TIMEKEYCOL is correlated to the TIME values

(For example TIMEKEYCOL= 20090115 for the date ‘2009-01-15’)

SELECT ... FROM CUSTDIM C, FACT F


F.TIMEKEYCOL = 20090115


Application Considerations – Table Partitioning / MDC

Another example … considerSELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F


T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL

First get the values for MINKEY and MAXKEY SELECT MIN(KEYCOL) FROM TIMEDIM WHERE YEAR=2009

SELECT MAX(KEYCOL) FROM TIMEDIM WHERE YEAR=2009

Then write the SQL as followsSELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F


T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL AND

F.TIMEKEYCOL >= MINKEY AND

F.TIMEKEYCOL <= MAXKEY


Application Considerations – General Recommendations

Avoid repetitions of complex expressions

Use Global Temporary Tables to split a query if it contains more than about 15 tables and compile time is an issue


Agenda

SESSION 1




SESSION 2




Configuration– Database Configuration

– DBMS Configuration

– Registry Settings

Operations– Collecting Statistics


Configuration

Optimization Level 5

Avoid multiple bufferpools of the same page size

Configuration thumb rules – BUFFPOOL ~= SHEAPTHRES

– SORTHEAP ~= SHEAPTHRES/(# of concurrent SORT, HSJN)


Registry Variables

DB2_ANTIJOIN=EXTEND• If slow queries have NOT EXISTS, NOT IN predicates


Registry Variables

DB2_REDUCED_OPTIMIZATION=YES– Set if compile time is an issue

IBM Service may recommend a more complex setting for example:

– DB2_REDUCED_OPTIMIZATION=10,15,20,00011000…. • First Part : DB2_REDUCED_OPTIMIZATION=A,B,C

– IF more than C joins, then "quick greedy"

– ELSE IF more than B joins, then use “greedy”

– ELSE IF more than A joins, use reduced “dynamic” strategy.

• Second Part not documented (Mainly intended for setting by service)


Best Practices

Optimization Level 5BUFFERPOOL~=SHEAPTHRES

DB2_ANTIJOIN=EXTENDDB2_REDUCED_OPTIMIZATION=YES


Collecting Statistics

The DB2 Query Optimizer relies on reasonably accurate statistics to get a good query plans

User runs RUNSTATS when data changes (part of ETL)

Statistics Fabrication (unreliable)– DB2 keeps UPDATE / DELETE / INSERT counters

– Fabrication limited to a few statistics – Not enough

Automatic Statistics– Automatically collects statistics on tables in need

– Runs in the background as a low priority job

Real Time Statistics– Collects statistics on-the-fly


AUTO RUNSTATS

Set Under Automatic Table Maintenance hierarchy– AUTO_RUNSTATS cannot be ON unless

AUTO_TBL_MAINT is ON

Automatic maintenance (AUTO_MAINT) = ON

Automatic database backup (AUTO_DB_BACKUP) = OFF

Automatic table maintenance (AUTO_TBL_MAINT) = ON

Automatic runstats (AUTO_RUNSTATS) = ON

Automatic statement statistics (AUTO_STMT_STATS) = OFF

Automatic statistics profiling (AUTO_STATS_PROF) = OFF

Automatic profile updates (AUTO_PROF_UPD) = OFF

Automatic reorganization (AUTO_REORG) = OFF


REAL TIME STATISTICS

Set Under Automatic Table Maintenance hierarchy– Real Time Statistics cannot be ON unless AUTO

RUNSTATS is ON

– AUTO_RUNSTATS cannot be ON unless AUTO_TBL_MAINT is ON

Automatic maintenance (AUTO_MAINT) = ON

Automatic database backup (AUTO_DB_BACKUP) = OFF

Automatic table maintenance (AUTO_TBL_MAINT) = ON

Automatic runstats (AUTO_RUNSTATS) = ON

Automatic statement statistics (AUTO_STMT_STATS) = ON

Automatic statistics profiling (AUTO_STATS_PROF) = OFF

Automatic profile updates (AUTO_PROF_UPD) = OFF

Automatic reorganization (AUTO_REORG) = OFF


Best Practices – RUNSTATS

Distribution Statistics– Collect large Quantile Statistics for Date columns– Collect distribution statistics on columns used in predicates

Index Statistics– Do not collect DETAILED INDEX statistics . Use SAMPLED

DETAILED INDEX statistics instead

Avoid statistics on columns you know will never be used in predicates or GROUP BY columns

Use TABLESAMPLE option for very large tables and statistical views

Use RUNSTATS Profiles to store customized invocations

RUNSTATS with ATTACH ?

COMMIT immediately after RUNSTATS of each table


Collecting StatisticsAutomatic RUNSTATS

Real Time StatisticsSAMPLED DETAILED INDEX

TABLESAMPLESelective column statistic specification

Use RUNSTATS PROFILES


Summary

Tips and best practices to improve data warehouse query performance have been discussed.

– Database Design

– Application Design

– Configuration and Operations

These include key considerations related to :– Parallelism

– Partitioning

– Schema

– Application queries

– Configuration

Session 2 will cover the Performance Layer

33

© Copyright IBM Corporation [current year]. All rights reserved.U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM’S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

Please update paragraph below for the particular product or family brand trademarks you mention such as WebSphere, DB2, Maximo, Clearcase, Lotus, etc

IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

If you have mentioned trademarks that are not from IBM, please update and add the following lines:

[Insert any special 3rd party trademark names/attributions here] Other company, product, or service names may be trademarks or service marks of others.

Disclaimer

December 01-03, 2009 •Minneapolis, Chicago, Milwaukee

Documents

Transcript of December 01-03, 2009 •Minneapolis, Chicago, Milwaukee